[WIP] Inference support for GPTQ (llama at least)

Commit

2 years ago

[WIP] Inference support for GPTQ (llama at least) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence.

References

#438 - Inference support for GPTQ (llama + falcon tested) + Quantization script

Author

Ubuntu

Committer

Narsil

Parents

5ce89059

text-generation-inference 9a12941b - [WIP] Inference support for GPTQ (llama at least)

text-generation-inference
9a12941b - [WIP] Inference support for GPTQ (llama at least)