openvino
9a25caa5 - [GPU] Fix Gemma4-E4B SDPA model (#35642)

Commit

17 days ago

[GPU] Fix Gemma4-E4B SDPA model (#35642) ### Details: This PR fixes two consecutive errors related to the inference of Gemma4-E4B SDPA version model on iGPU 1. In this model, a single `KVCache` is shared between multiple `SDPA` layers So problems arise in `SyncInferRequest::allocate_states` and `VariableStateIndirectKVCacheCompressed` is created when one `kv_cache_prim` is compressed and the other is not, which leads to output mismatches in the plugin. Currently, this only affects iGPUs with `supports_immad=false`, so we can temporarily fix it by disabling `KVCacheCompression` for such a graph, which removes potential mismatches. It will need to be redone when PA models are enabled. 2. Out-of-bounds SLM access in `sdpa_opt` finalization kernel. The finalization stage allocated `tmp_slm[SUBGROUP_SIZE]` elements for cross-subgroup reduction, but the actual number of subgroups per workgroup is `SUBGROUPS_PER_WG = CEIL_DIV(V_HEAD_SIZE * SG_SCALE_FACTOR, SUBGROUP_SIZE)`. When `V_HEAD_SIZE` > `SUBGROUP_SIZE^2` (e.g. `head_size=512`, `SUBGROUP_SIZE=16` gives `SUBGROUPS_PER_WG`=32 > 16), `tmp_slm[sgid]` writes go out of bounds. So changed `tmp_slm` allocation to `SUBGROUPS_PER_WG` and replaced the single-pass lane-indexed reduction with a folded loop over `CEIL_DIV(SUBGROUPS_PER_WG, SUBGROUP_SIZE)` iterations, correctly reducing across all subgroups regardless of head size. There are problems reproducing the problem using the test, apparently more interactions with memory are needed to reproduce it as in the inference of the whole model ### AI Assistance: - *AI assistance used: yes*

References

#35642 - [GPU] Fix Gemma4-E4B SDPA model

Author

Lyamin-Roman

Parents

ef1db002

openvino 9a25caa5 - [GPU] Fix Gemma4-E4B SDPA model (#35642)

openvino
9a25caa5 - [GPU] Fix Gemma4-E4B SDPA model (#35642)