openvino
eed5648f - [NPUW] INT-8 dynamically quantized kv-cache (#34516)

Commit

31 days ago

[NPUW] INT-8 dynamically quantized kv-cache (#34516) ### Details: - Mixed kv-cache compression (key:u8-assym, values: u4-sym) might be enabled by config option ``` "KV_CACHE_PRECISION": "u8" ``` - Currently number of compilation issue for i4 compression exists - so default scheme is (key:u8-assym, value: i8-sym) - Performance concerns are exist, but e2e pipeline functionally worked fine in u8 kv-cache mode, accuracy is on pare on limited scope tests - For value-cache experimental mode is possible : i4 - symmetrical quantization, this verified by RMSE metric on real kv-caches, also some work is here: [Quantize What Counts: More For Keys, Less For Values](https://arxiv.org/abs/2502.15075v3) ### Tickets: - C-180875 ### AI Assistance: - *AI assistance used: yes* - *AI used to synthesize tests, and for reverse crafted decomposition v3 based on fuse-pattern*

References

#34516 - [NPUW] INT-8 dynamically quantized kv-cache

Author

esmirno

Parents

9ac988fd

openvino eed5648f - [NPUW] INT-8 dynamically quantized kv-cache (#34516)

openvino
eed5648f - [NPUW] INT-8 dynamically quantized kv-cache (#34516)