openvino
29865046 - [GPU] Enable SDPA Fusion for 3D inputs and implicit GQA broadcasting (#36029)

Commit
6 days ago
[GPU] Enable SDPA Fusion for 3D inputs and implicit GQA broadcasting (#36029) This PR enhances the UnsqueezeBroadcastReshapeSDPAFusion pass to capture more flexible attention topologies, specifically targeting MQA and GQA patterns. Key Enhancements: 1) 3D Input Support: The pass now successfully intercepts and reshapes 3D Key/Value tensors feeding into Unsqueeze or Reshape nodes. 2) GQA Implicit Broadcasting: Previously, the fusion aborted if the model required expanding KV heads to match a larger number of Query heads. This PR introduces dynamic shape extraction to bypass explicit Broadcast nodes. The SDPA kernel will now perform an implicit broadcast in the registers—for example, natively mapping 2 KV heads across 32 Query heads (a 1:16 ratio)—saving massive memory bandwidth. ### Tickets: - CVS-186566, CVS-187072 ### AI Assistance: - *AI assistance used: yes - Debug and test generation
Author
Parents
Loading