openvino
8cb9ba0b - [GPU] Fix LoRA adapter accuracy by enabling LoRA horizontal fusion and adjusting activation scaling (#34809)

Commit
53 days ago
[GPU] Fix LoRA adapter accuracy by enabling LoRA horizontal fusion and adjusting activation scaling (#34809) ### Details: - **Problem** - LoRA-adapted LLM models (e.g., Qwen3-VL-4B-Instruct) produce incorrect results on GPU due to two FP16 precision issues: - **FC horizontal fusion numerical divergence**: Merging QKV FCs into a single large FC changes GEMM tiling/accumulation order, producing different FP16 results. Without LoRA this is tolerable, but with LoRA each FC output is consumed individually by `Add(FC, MatMul(Multiply(alpha, A), B))` before concatenation, so the numerical error propagates through LoRA ops and amplifies across layers. - **Missing activation scaling**: `activations_scale_factor` is skipped for LLM models, but LoRA-adapted models can produce larger activation ranges that overflow FP16 on non-IMMAD platforms. - **Solution** - **Enabling LoRA horizontal fusion** on IMMAD platforms resolves this by fusing LoRA Add as post-sum into the fused FC kernel. - **Enable activation scaling for LLM+LoRA on non-IMMAD platforms** by detecting `lora_state_` model variables. ### Tickets: - [CVS-183147](https://jira.devtools.intel.com/browse/CVS-183147) ### AI Assistance: - *AI assistance used: no* - *If yes, summarize how AI was used and what human validation was performed (build/tests/manual checks).* --------- Signed-off-by: Andrew Park <andrew.park@intel.com>
Author
Parents
Loading