[GPU] Fix Gemma4-E4B SDPA model (#35642)
### Details:
This PR fixes two consecutive errors related to the inference of
Gemma4-E4B SDPA version model on iGPU
1. In this model, a single `KVCache` is shared between multiple `SDPA`
layers
So problems arise in `SyncInferRequest::allocate_states` and
`VariableStateIndirectKVCacheCompressed` is created when one
`kv_cache_prim` is compressed and the other is not, which leads to
output mismatches in the plugin.
Currently, this only affects iGPUs with `supports_immad=false`, so we
can temporarily fix it by disabling `KVCacheCompression` for such a
graph, which removes potential mismatches.
It will need to be redone when PA models are enabled.
2. Out-of-bounds SLM access in `sdpa_opt` finalization kernel.
The finalization stage allocated `tmp_slm[SUBGROUP_SIZE]` elements for
cross-subgroup reduction, but the actual number of subgroups per
workgroup is `SUBGROUPS_PER_WG = CEIL_DIV(V_HEAD_SIZE * SG_SCALE_FACTOR,
SUBGROUP_SIZE)`.
When `V_HEAD_SIZE` > `SUBGROUP_SIZE^2` (e.g. `head_size=512`,
`SUBGROUP_SIZE=16` gives `SUBGROUPS_PER_WG`=32 > 16), `tmp_slm[sgid]`
writes go out of bounds.
So changed `tmp_slm` allocation to `SUBGROUPS_PER_WG` and replaced the
single-pass lane-indexed reduction with a folded loop over
`CEIL_DIV(SUBGROUPS_PER_WG, SUBGROUP_SIZE)` iterations, correctly
reducing across all subgroups regardless of head size.
There are problems reproducing the problem using the test, apparently
more interactions with memory are needed to reproduce it as in the
inference of the whole model
### AI Assistance:
- *AI assistance used: yes*