fix(server): llama v2 GPTQ (#648)

Commit

2 years ago

fix(server): llama v2 GPTQ (#648) As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ```

References

#648 - fix(server): llama v2 GPTQ

Author

fxmarty

Parents

214c06f5

text-generation-inference 362883f2 - fix(server): llama v2 GPTQ (#648)

text-generation-inference
362883f2 - fix(server): llama v2 GPTQ (#648)