[GPU] Fix OOB memory access in gemm_tiled_opt kernel for non-aligned tile dimensions (#34482)
### Description of the issue(symptom, root-cause, how it was resolved)
- The original gemm_tiled_opt kernel uses BLOCK_READ_B (sub-group block
reads) to load B matrix tiles, which always reads SIMD_WIDTH ×
B_VEC_SIZE contiguous elements. When the N dimension is not evenly
divisible by the tile size (TILE_N), the last tile group along N extends
beyond the allocated buffer boundary, causing an out-of-bounds memory
access (CL_OUT_OF_RESOURCES). The same issue applies to BLOCK_READ_A in
the static K-leftover path when K is not aligned to TILE_K.
- Add boundary checks for BLOCK_READ operations in gemm_tiled_opt.cl to
prevent CL_OUT_OF_RESOURCES errors when matrix dimensions are not
aligned to tile sizes.
Changes:
- Add tile_n_offset bounds check before BLOCK_READ_B in dynamic and
static paths (main loop and K-leftover sections)
- Add K dimension bounds check before BLOCK_READ_A in static K-leftover
section
- Guard static path checks with #if TILE_N_NOT_DIVISIBLE to ensure zero
overhead for tile-aligned shapes
- Add regression test for real model shape (MatMul_147904: M=128,
K=1025, N=199, batch=32)
#### The code and line that caused this issue (if it is not changed
directly)
- src/plugins/intel_gpu/src/kernel_selector/cl_kernels/gemm_tiled_opt.cl
#### Reproduction step and snapshot (if applicable. Do not attach for
customer model)
- $ ./benchmark_app -d GPU -m ~/cvs173214/emb.xml -hint none -nstreams 1
-nireq 1 -niter 1 -infer_precision f32
#### Problematic graph
-
<img width="617" height="456" alt="image"
src="https://github.com/user-attachments/assets/ad952234-be65-4b2b-afc0-89687e678f78"
/>
#### Checklist
- [x] Is it a proper fix? (not a workaround)
- [x] Did you include test case for this fix, if necessary?
- [x] Did you review existing test that can be extended to cover this
scenario? Which test did you review?
### Tickets:
- 173214