Fix GQA Parity (#27108)

Commit

29 days ago

Fix GQA Parity (#27108) Fix [#27079](https://github.com/microsoft/onnxruntime/issues/27079) - Qwen3 model quality regression on CUDA backend. ### Root Cause Analysis The parity issue was caused by **buffer pointer misconfiguration** in the GQA (Group Query Attention) QKV preprocessing pipeline. The original implementation used multiple separate kernels for: 1. Unpacking packed QKV tensor 2. Applying RoPE (Rotary Position Embedding) to Q and K 3. Appending K/V to cache This multi-kernel approach created opportunities for misconfiguration: - Buffers were allocated but not properly used - Pointers could reference memory that was not yet allocated or initialized - Buffer sharing logic was fragmented across different code paths ### Solution Consolidate QKV preprocessing into a **single fused kernel** (`UnpackRoPEAppend`) that performs all operations in one pass: 1. **Unified kernel design**: A single kernel handles unpacking, RoPE application, and cache append operations 2. **Simplified buffer management**: The new `PrepareQKV` function clearly manages buffer allocation and ensures proper initialization 3. **Explicit past-to-present cache copy**: When `past_present_share_buffer` is false, explicitly copy past KV cache to present buffer before appending new tokens 4. **Zero-initialization for non-shared buffers**: Clear present KV buffers when not sharing with past to ensure deterministic output ### Changes Summary | File | Changes | |------|---------| | [group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0) | New fused `UnpackRoPEAppend` kernel with shared memory optimization for non-interleaved RoPE | | `group_query_attention_impl.cu` | New `PrepareQKV` helper function that orchestrates buffer setup and kernel launch | | `group_query_attention.cc` | Simplified operator logic by delegating QKV prep to unified helper | | `test_gqa.py` | Enhanced test coverage for various QKV configurations | ### Key Improvements - **Reduced kernel launches**: From 4-5 separate kernel calls to a single fused kernel - **Better memory safety**: All buffer pointers are validated in a single location - **Improved RoPE handling**: Uses shared memory for efficient non-interleaved RoPE computation - **Deterministic output**: Explicit buffer initialization ensures consistent results across runs - **Compatible with quantized KV cache**: The new preprocessing kernel design supports future quantization work ### Testing - All existing GQA unit tests pass - Verified Qwen3 model no longer produces gibberish output - Tested both fp16/bf16 and various head configurations

References

#27108 - Fix GQA Parity

Author

tianleiwu

Parents

879ec039

onnxruntime 990ba5f0 - Fix GQA Parity (#27108)

onnxruntime
990ba5f0 - Fix GQA Parity (#27108)