I tried running the quantization on my AMD RX 7900 XTX as described and it failed with the following output:
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 605, in <module>
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 552, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 416, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "/data/linux_data/AI/LLM/WebUI/repositories/gpt-fast/quantize.py", line 351, in prepare_int4_weight_and_scales_and_zeros
weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/_ops.py", line 825, in __call__
return self_._op(*args, **(kwargs or {}))
RuntimeError: _convert_weight_to_int4pack_cuda is not available for build.
Edit: Seems like I'm not the only one on ROCm with this error: pytorch-labs/gpt-fast#12 (comment)
Thanks for the feedback @lufixSch. That's a bummer, it seems like every time some pure PyTorch loader appears, it doesn't really work on anything but NVIDIA.
(/home/pai/text-generation-webui/installer_files/env) pai@localhost:~/text-generation-webui> python server.py --model Llama-2-7b-chat-hf --loader "gpt-fast"
10:24:51-722859 INFO Starting Text generation web UI
10:24:51-728501 INFO Loading Llama-2-7b-chat-hf
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.84it/s]
10:26:23-967663 INFO LOADER: Transformers
10:26:23-969554 INFO TRUNCATION LENGTH: 4096
10:26:23-970050 INFO INSTRUCTION TEMPLATE: Custom (obtained from model metadata)
10:26:23-970568 INFO Loaded the model in 92.24 seconds.
10:26:23-971045 INFO Loading the extension "gallery"
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
10:26:41-825438 INFO Deleted logs/chat/Assistant/20240107-10-01-41.json.
Output generated in 284.83 seconds (1.23 tokens/s, 350 tokens, context 76, seed 1022857021)
Is it loaded as transformers model load on web-ui? It seems to work well on my 5600g AMD APU
No, it should appear as LOADER: gpt-fast
. You can enforce that with --loader "gpt-fast"
by the looks of it this is now only for nvidia for some reason
Z:\ai\text-generation-webui-gpt-fast>python repositories/quantize.py --checkpoint_path models/llama-2-7b-hf/model.pth --mode int4
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
Traceback (most recent call last):
File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 605, in <module>
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 552, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "Z:\ai\text-generation-webui-gpt-fast\repositories\quantize.py", line 417, in create_quantized_state_dict
weight.to(torch.bfloat16).to('cuda'), self.groupsize, self.inner_k_tiles
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "Z:\ai\text-generation-webui-gpt-fast\installer_files\env\Lib\site-packages\torch\cuda\__init__.py", line 307, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
Yes, it does seem to not be universal after all despite being pure interpreted PyTorch. In that case, I don't see the appeal of including and maintaining this loader.
Login to write a write a comment.
Reference: https://github.com/pytorch-labs/gpt-fast
The advantage of this backend is that it requires no wheels or compiled extensions. It is 100% PyTorch, so it should in principle work on any GPU (including AMD, Intel Arc, and Metal).
Feedback is welcome.
Installation
It is necessary to upgrade to the latest PyTorch (Nightly) and clone the repository:
The first command changes for each hardware and can be found here: https://pytorch.org/get-started/locally/
Model conversion
Load a model
TODO
It is possible to use
torch.compile()
with this backend, which improves performance by a factor of around 2. I haven't added the option yet.