openvino
5071f092 - [GPU] Fix fully_connected_gpu_gemv compilation on modern NEO drivers (Gen 9.5+) (#35661)

Commit
23 days ago
[GPU] Fix fully_connected_gpu_gemv compilation on modern NEO drivers (Gen 9.5+) (#35661) ### Details OpenCL compilation of `fully_connected_gpu_gemv.cl` fails on Intel Compute Runtime (NEO) **23.x and newer** with: ``` error: no matching function for call to 'intel_sub_group_block_read' ``` Seven call sites in the kernel pass `__local uint *` to `intel_sub_group_block_read`. Per the [`cl_intel_subgroups` extension spec](https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroups.html), this builtin is only defined for `__global` pointers; an unofficial `__local` overload was carried in older NEO releases and was dropped — it was never part of any extension contract. #### Symptom - Hardware: any Intel Gen ≥ 9.5 GPU (UHD 630 / UHD P630 / Iris Xe / Arc). - Driver: confirmed broken on Intel Compute Runtime (NEO) **23.43.27642.18**, **24.05.28454.6**, and tip of [intel/compute-runtime](https://github.com/intel/compute-runtime). - Models impacted: any model that hits the int4 weight-only FC path (Qwen3, Llama, Mistral, Phi, etc., when run via the GPU plugin). - Failure mode: `LOAD_NETWORK` aborts with the OpenCL-compiler error above; the model never loads. - The 2024.6 release uses the working `__global` block-read path and is unaffected — this is a regression introduced during the 2025 dev cycle in the int4 fc gemv rewrite. #### Root cause `intel_sub_group_block_read(const __global uint*)` is the only form defined by the `cl_intel_subgroups` extension. Older NEO releases shipped an additional `__local`-pointer overload in their built-in headers; that overload was removed because it was never part of any contract. The seven affected lines in `fully_connected_gpu_gemv.cl` were written against that unofficial overload: | line | kernel layout | array | | --- | --- | --- | | 226 | `OS_IS_YX_OSV16` | `all_sum_even` | | 370 | `OS_IS_YX_OSV32_ISV2` | `all_sum_even` | | 371 | `OS_IS_YX_OSV32_ISV2` | `all_sum_odd` | | 566 | `OS_IS_YX_OSV64_ISV2` | `all_sum_0` | | 567 | `OS_IS_YX_OSV64_ISV2` | `all_sum_1` | | 568 | `OS_IS_YX_OSV64_ISV2` | `all_sum_2` | | 569 | `OS_IS_YX_OSV64_ISV2` | `all_sum_3` | #### Fix Each `intel_sub_group_block_read((const __local uint*)p)` call is replaced with `((const __local uint*)p)[get_sub_group_local_id()]`. This is semantically equivalent because: 1. The kernel forces a 16-wide sub-group via `__attribute__((intel_reqd_sub_group_size(SUBGROUP_SIZE)))` (line 98), with `SUBGROUP_SIZE == 16`. 2. The source arrays are declared `__local float[16][16]` and indexed `[wi_id][thr_id]`, where `wi_id == get_sub_group_local_id()`. 3. `intel_sub_group_block_read` on a 16-wide sub-group reads 16 consecutive uints from the pointer and distributes them across the 16 lanes — lane `i` receives word `i`. The per-lane scalar load `p[get_sub_group_local_id()]` produces the same per-lane value. 4. The `__local` arrays are written under `barrier(CLK_LOCAL_MEM_FENCE)` already (lines 223, 367, 563), so the read is correctly ordered. 5. The result is fed into `sub_group_reduce_add` immediately afterwards — that consumer is unchanged. The replacement compiles on every NEO version (no extension required, just a standard `__local` pointer dereference indexed by `get_sub_group_local_id()`). #### Tickets No prior issue or PR found mentioning `intel_sub_group_block_read` in this repo as of master @ 2026-05-04. Six related fc-gemv PRs (#28976, #31486, #31806, #31477, #32710 / #32735, #32749) touch the same kernel for unrelated reasons and don't address the `__local` block-read issue. #### Testing - Tested on Intel UHD P630 (Gen 9.5) + NEO 23.43.27642.18 + IGC 1.0.15468.11. The replacement strings are byte-for-byte the same length as the originals, so I could verify the patch by binary-string-replacing them inline into the prebuilt `libopenvino_intel_gpu_plugin.so` from `openvino==2025.4.1` — bit-equivalent to a clean source rebuild. - Result: Qwen3-1.7B INT4 weight-only loads cleanly on GPU and decodes at ~12 tok/s. Same model + the original kernel: `clBuildProgram` aborts with the error above; model never loads. - No answer-quality regression observed in side-by-side comparison vs the CPU plugin reference on a small test set of HA-state prompts. - I have not personally re-run `ov_gpu_unit_tests --gtest_filter='*gemv*'` against this patched master tree (I tested on the 2025.4.1 binary). PR reviewers running the GPU unit tests will exercise this code path. #### AI Assistance - AI assistance used: yes - The patch was drafted with AI help, then human-verified by reading the surrounding kernel context to confirm the 16×16 `[wi_id][thr_id]` layout and the `intel_reqd_sub_group_size(16)` attribute, and tested end-to-end with an int4 LLM decode on the affected hardware via the binary-patch path described above. Signed-off-by: Jordan Anderson <paul.jordan.anderson@gmail.com>
Author
Parents
Loading