[GPU] Fix fully_connected_gpu_gemv compilation on modern NEO drivers (Gen 9.5+) (#35661)
### Details
OpenCL compilation of `fully_connected_gpu_gemv.cl` fails on Intel
Compute Runtime (NEO) **23.x and newer** with:
```
error: no matching function for call to 'intel_sub_group_block_read'
```
Seven call sites in the kernel pass `__local uint *` to
`intel_sub_group_block_read`. Per the [`cl_intel_subgroups` extension
spec](https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroups.html),
this builtin is only defined for `__global` pointers; an unofficial
`__local` overload was carried in older NEO releases and was dropped —
it was never part of any extension contract.
#### Symptom
- Hardware: any Intel Gen ≥ 9.5 GPU (UHD 630 / UHD P630 / Iris Xe /
Arc).
- Driver: confirmed broken on Intel Compute Runtime (NEO)
**23.43.27642.18**, **24.05.28454.6**, and tip of
[intel/compute-runtime](https://github.com/intel/compute-runtime).
- Models impacted: any model that hits the int4 weight-only FC path
(Qwen3, Llama, Mistral, Phi, etc., when run via the GPU plugin).
- Failure mode: `LOAD_NETWORK` aborts with the OpenCL-compiler error
above; the model never loads.
- The 2024.6 release uses the working `__global` block-read path and is
unaffected — this is a regression introduced during the 2025 dev cycle
in the int4 fc gemv rewrite.
#### Root cause
`intel_sub_group_block_read(const __global uint*)` is the only form
defined by the `cl_intel_subgroups` extension. Older NEO releases
shipped an additional `__local`-pointer overload in their built-in
headers; that overload was removed because it was never part of any
contract. The seven affected lines in `fully_connected_gpu_gemv.cl` were
written against that unofficial overload:
| line | kernel layout | array |
| --- | --- | --- |
| 226 | `OS_IS_YX_OSV16` | `all_sum_even` |
| 370 | `OS_IS_YX_OSV32_ISV2` | `all_sum_even` |
| 371 | `OS_IS_YX_OSV32_ISV2` | `all_sum_odd` |
| 566 | `OS_IS_YX_OSV64_ISV2` | `all_sum_0` |
| 567 | `OS_IS_YX_OSV64_ISV2` | `all_sum_1` |
| 568 | `OS_IS_YX_OSV64_ISV2` | `all_sum_2` |
| 569 | `OS_IS_YX_OSV64_ISV2` | `all_sum_3` |
#### Fix
Each `intel_sub_group_block_read((const __local uint*)p)` call is
replaced with `((const __local uint*)p)[get_sub_group_local_id()]`.
This is semantically equivalent because:
1. The kernel forces a 16-wide sub-group via
`__attribute__((intel_reqd_sub_group_size(SUBGROUP_SIZE)))` (line 98),
with `SUBGROUP_SIZE == 16`.
2. The source arrays are declared `__local float[16][16]` and indexed
`[wi_id][thr_id]`, where `wi_id == get_sub_group_local_id()`.
3. `intel_sub_group_block_read` on a 16-wide sub-group reads 16
consecutive uints from the pointer and distributes them across the 16
lanes — lane `i` receives word `i`. The per-lane scalar load
`p[get_sub_group_local_id()]` produces the same per-lane value.
4. The `__local` arrays are written under `barrier(CLK_LOCAL_MEM_FENCE)`
already (lines 223, 367, 563), so the read is correctly ordered.
5. The result is fed into `sub_group_reduce_add` immediately afterwards
— that consumer is unchanged.
The replacement compiles on every NEO version (no extension required,
just a standard `__local` pointer dereference indexed by
`get_sub_group_local_id()`).
#### Tickets
No prior issue or PR found mentioning `intel_sub_group_block_read` in
this repo as of master @ 2026-05-04. Six related fc-gemv PRs (#28976,
#31486, #31806, #31477, #32710 / #32735, #32749) touch the same kernel
for unrelated reasons and don't address the `__local` block-read issue.
#### Testing
- Tested on Intel UHD P630 (Gen 9.5) + NEO 23.43.27642.18 + IGC
1.0.15468.11. The replacement strings are byte-for-byte the same length
as the originals, so I could verify the patch by binary-string-replacing
them inline into the prebuilt `libopenvino_intel_gpu_plugin.so` from
`openvino==2025.4.1` — bit-equivalent to a clean source rebuild.
- Result: Qwen3-1.7B INT4 weight-only loads cleanly on GPU and decodes
at ~12 tok/s. Same model + the original kernel: `clBuildProgram` aborts
with the error above; model never loads.
- No answer-quality regression observed in side-by-side comparison vs
the CPU plugin reference on a small test set of HA-state prompts.
- I have not personally re-run `ov_gpu_unit_tests
--gtest_filter='*gemv*'` against this patched master tree (I tested on
the 2025.4.1 binary). PR reviewers running the GPU unit tests will
exercise this code path.
#### AI Assistance
- AI assistance used: yes
- The patch was drafted with AI help, then human-verified by reading the
surrounding kernel context to confirm the 16×16 `[wi_id][thr_id]` layout
and the `intel_reqd_sub_group_size(16)` attribute, and tested end-to-end
with an int4 LLM decode on the affected hardware via the binary-patch
path described above.
Signed-off-by: Jordan Anderson <paul.jordan.anderson@gmail.com>