onnxruntime
a4976e33 - Add support for uint8_t as data type for GatherBlockQuantized (#24239)

Commit

299 days ago

Add support for uint8_t as data type for GatherBlockQuantized (#24239) ### Description This change adds support for GatherBlockQuantized to use uin8_t as data's type with the same semantics as MatMulNBits. Zero_Points and Gather Axis other than 0 are not yet supported, in order to keep the change scoped. ### Motivation and Context With the newer llama models like Phi4 trained with shared embeddings, the weights of the lm_head matrix and the embeddings table are exactly the same. These embeddings are huge, unquantized embeddings are 1.2GB in Phi4 mini instruct, at int4 quantization the weights are still 300MB. We can go a step further and have these two ops the lm_head matmulnbits and GatherBlockQuantized share the same weights, that would save 300MB on the model size. The two things that hinder that are the shape expectations for GatherBlockQuantized and the data type supported for data in GatherBlockQuantized. The shape can be solved via a simple reshape op, but the data type needs code changes and that is what this change does. Here is Phi4 modified with shared weights between lm_head and matmulnbits, this model is just 2.1GB on disk. <img width="164" alt="image" src="https://github.com/user-attachments/assets/8bdddbb9-5b44-4839-ab48-605bee53d66b" /> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

References

#24239 - Add support for uint8_t as data type for GatherBlockQuantized

Author

sushraja-msft

Parents

11fda2ad

onnxruntime a4976e33 - Add support for uint8_t as data type for GatherBlockQuantized (#24239)

onnxruntime
a4976e33 - Add support for uint8_t as data type for GatherBlockQuantized (#24239)