openvino
083dcb63 - [GPU] Add Adaptive RKV Diversity Calculation Support to PagedAttention (#33568)

Commit
130 days ago
[GPU] Add Adaptive RKV Diversity Calculation Support to PagedAttention (#33568) ### Details: - Add Adaptive RKV diversity calculation to PagedAttention, enabling dynamic KV cache eviction based on token similarity analysis - Implement new OpenCL kernel(`pa_adaptive_rkv_diversity_ref.cl`) - Compute diversity scores using cosine similarity between key vectors - Load and normalize key vectors with optional dequantization - Compute cosine similarities (upper triangle only for symmetry) - Apply row-wise mean thresholding across attention heads - Calculate negative block sums for eviction scoring - Supports compressed (BY_CHANNEL, BY_TOKEN) and uncompressed KV cache - Add `AdaptiveRKVDIversityGenerator` stage to `paged_attention_opt.cpp` - Conditional execution based on runtime `evictable_sizes` input - Dynamic allocation of 4 intermediate buffers for diversity calculation - Add unit tests in `paged_attention_gpu_test.cpp` - Reference implementation for validation ### Tickets: - [CVS-173522](https://jira.devtools.intel.com/browse/CVS-173522) --------- Signed-off-by: Andrew Park <andrew.park@intel.com>
Author
Parents
Loading