[GPU] Fix LoRA adapter accuracy by enabling LoRA horizontal fusion and adjusting activation scaling (#34809)
### Details:
- **Problem**
- LoRA-adapted LLM models (e.g., Qwen3-VL-4B-Instruct) produce incorrect
results on GPU due to two FP16 precision issues:
- **FC horizontal fusion numerical divergence**: Merging QKV FCs into a
single large FC changes GEMM tiling/accumulation order, producing
different FP16 results. Without LoRA this is tolerable, but with LoRA
each FC output is consumed individually by `Add(FC,
MatMul(Multiply(alpha, A), B))` before concatenation, so the numerical
error propagates through LoRA ops and amplifies across layers.
- **Missing activation scaling**: `activations_scale_factor` is skipped
for LLM models, but LoRA-adapted models can produce larger activation
ranges that overflow FP16 on non-IMMAD platforms.
- **Solution**
- **Enabling LoRA horizontal fusion** on IMMAD platforms resolves this
by fusing LoRA Add as post-sum into the fused FC kernel.
- **Enable activation scaling for LLM+LoRA on non-IMMAD platforms** by
detecting `lora_state_` model variables.
### Tickets:
- [CVS-183147](https://jira.devtools.intel.com/browse/CVS-183147)
### AI Assistance:
- *AI assistance used: no*
- *If yes, summarize how AI was used and what human validation was
performed (build/tests/manual checks).*
---------
Signed-off-by: Andrew Park <andrew.park@intel.com>