llama.cpp
ggml-quants : ternary packing for TriLMs and BitNet b1.58
#8151
Merged

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

compilade merged 33 commits into master from compilade/bitnet-ternary
compilade
compilade327 days ago (edited 279 days ago)👍 25🎉 10❤ 9🚀 6

This adds 1.6875 bpw and 2.0625 bpw quant types for TriLMs and BitNet b1.58 models. For now, these are named TQ1_0 and TQ2_0, respectively.
I had given glimpses of this idea starting from #7931 (comment).

The 1.6875 bpw type mostly relies on the fact that 35 == 243 < 256 == 28 to pack 5 trits per byte.

(I also made a blog post about ternary packing in an attempt to explain the core idea a bit more (storing the values in fixed-point to extract the most significant digit first with multiplications))

Huge thanks to @Eddie-Wang1120, who motivated this by adding initial BitNet b1.58 support in #7931.

How to try it

Using TriLM models is the easiest because all of their models have row sizes divisible by 256.

Important

To quantize the token embeddings and the output tensor to Q4_K and Q6_K, you need to use llama-quantize on the model files produced by convert_hf_to_gguf.py --outtype tq1_0 (and also for tq2_0). Otherwise these two tensors are kept as f16 and are responsible for most of the size of the models.

$ python3 convert_hf_to_gguf.py /path/to/TriLM_3.9B_Unpacked/ --outfile /somewhere/TriLM-3.9B-TQ1_0-big.gguf --outtype tq1_0
$ ./build/bin/llama-quantize /somewhere/TriLM-3.9B-TQ1_0-big.gguf /somewhere/TriLM-3.9B-TQ1_0.gguf tq1_0

If you want to try TQ2_0, which is faster (but bigger) than TQ1_0 on compute-bound hardware, you can replace tq1_0 with tq2_0 in the above example, but it's also possible to convert from the TQ1_0 model file.

The two ternary formats hold the same values, so round-trip quantizing between the two should result in the same files.

$ ./build/bin/llama-quantize --allow-requantize /somewhere/TriLM-3.9B-TQ1_0.gguf /somewhere/TriLM-3.9B-TQ2_0.gguf tq2_0

Speed

TQ2_0 is twice as fast as Q4_K on my laptop. It's the fastest quant on compute-bound AVX2-capable computers.

This is a table of the float32-equivalent throughput of the vec_dot_q operation for each of these quant types.

CPU F16 Q8_0 Q4_K Q2_K TQ1_0 TQ2_0
Intel Core m3-8100Y (AVX2) 30.60 GB/s 67.03 GB/s 64.17 GB/s 81.73 GB/s 70.31 GB/s 141.83 GB/s
Arm Cortex A72 (NEON) 3.84 GB/s 9.51 GB/s 9.26 GB/s 9.79 GB/s 11.81 GB/s 15.78 GB/s
Arm Cortex A53 (NEON) 4.30 GB/s 5.87 GB/s 5.76 GB/s 5.84 GB/s 8.97 GB/s 10.29 GB/s
AWS t4g (NEON) 8.69 GB/s 22.35 GB/s 25.34 GB/s 22.84 GB/s 33.34 GB/s 44.80 GB/s
AWS t4g (DOTPROD) 49.17 GB/s 42.63 GB/s 45.40 GB/s 29.84 GB/s 40.44 GB/s 65.76 GB/s

From this, it's easy to see that TQ1_0 is usually slightly faster than Q4_K, and that TQ2_0 is by far the fastest quant on AVX2.

Note

There might be a way to make a similar type as TQ2_0 like some sort of Q2_1, which could be almost as fast but still usable by non-ternary models, but this will probably require something like LQER to help with keeping some precision.

Raw data (click to expand)

Intel Core m3-8100Y:

$ for t in bf16 f16 q8_0 q4_0 q4_K q2_K tq1_0 tq2_0; do ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type "$t"; done
bf16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      4.28
      avg cycles/32 vals   :      4.72
      float32 throughput   :     37.89 GB/s
      quantized throughput :     18.95 GB/s

f16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      5.52
      avg cycles/32 vals   :      5.93
      float32 throughput   :     30.60 GB/s
      quantized throughput :     15.30 GB/s

q8_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      2.27
      avg cycles/32 vals   :      2.56
      float32 throughput   :     67.03 GB/s
      quantized throughput :     17.81 GB/s

q4_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      3.04
      avg cycles/32 vals   :      3.38
      float32 throughput   :     52.20 GB/s
      quantized throughput :      7.34 GB/s

q4_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      2.22
      avg cycles/32 vals   :      2.61
      float32 throughput   :     64.17 GB/s
      quantized throughput :      9.02 GB/s

q2_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      1.77
      avg cycles/32 vals   :      1.99
      float32 throughput   :     81.73 GB/s
      quantized throughput :      6.70 GB/s

tq1_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      2.12
      avg cycles/32 vals   :      2.33
      float32 throughput   :     70.31 GB/s
      quantized throughput :      3.71 GB/s

tq2_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.85
      avg cycles/32 vals   :      0.97
      float32 throughput   :    141.83 GB/s
      quantized throughput :      9.14 GB/s

Arm Cortex A72 (Raspberry Pi 4):

$ for t in f16 q8_0 q4_K q2_K tq1_0 tq2_0; do ./bin/test-quantize-perf --op vec_dot_q -i 2000000 --type "$t"; done                                                                                        
f16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      3.84 GB/s
      quantized throughput :      1.92 GB/s

q8_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      9.51 GB/s
      quantized throughput :      2.53 GB/s

q4_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      9.26 GB/s
      quantized throughput :      1.30 GB/s

q2_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      9.79 GB/s
      quantized throughput :      0.80 GB/s

tq1_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     11.81 GB/s
      quantized throughput :      0.62 GB/s

tq2_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     15.78 GB/s
      quantized throughput :      1.02 GB/s

Arm Cortex A53 (Some Android phone from 2017):

$ for t in f16 q8_0 q4_K q2_K tq1_0 tq2_0; do ./bin/test-quantize-perf --op vec_dot_q -i 2000000 --type "$t"; done
f16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      4.30 GB/s
      quantized throughput :      2.15 GB/s

q8_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      5.87 GB/s
      quantized throughput :      1.56 GB/s

q4_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      5.76 GB/s
      quantized throughput :      0.81 GB/s

q2_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      5.84 GB/s
      quantized throughput :      0.48 GB/s

tq1_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      8.97 GB/s
      quantized throughput :      0.47 GB/s

tq2_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     10.29 GB/s
      quantized throughput :      0.66 GB/s

AWS t4g.small instance (Arm Neoverse N1) using NEON:

$ for t in f16 q8_0 q4_K q2_K tq1_0 tq2_0; do ./bin/test-quantize-perf --op vec_dot_q -i 2000000 --type "$t"; done
f16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      8.69 GB/s
      quantized throughput :      4.35 GB/s

q8_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     22.35 GB/s
      quantized throughput :      5.94 GB/s

q4_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     25.34 GB/s
      quantized throughput :      3.56 GB/s

q2_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     22.84 GB/s
      quantized throughput :      1.87 GB/s

tq1_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     33.34 GB/s
      quantized throughput :      1.76 GB/s

tq2_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     44.80 GB/s
      quantized throughput :      2.89 GB/s

AWS t4g.small (Arm Neoverse N1) with -march=native:

$ for t in f16 q8_0 q4_K q2_K tq1_0 tq2_0; do ./tests/test-quantize-perf --op vec_dot_q -i 2000000 --type "$t"; done
f16
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     49.17 GB/s
      quantized throughput :     24.59 GB/s

q8_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     42.63 GB/s
      quantized throughput :     11.32 GB/s

q4_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     45.40 GB/s
      quantized throughput :      6.38 GB/s

q2_K
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     29.84 GB/s
      quantized throughput :      2.45 GB/s

tq1_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     40.44 GB/s
      quantized throughput :      2.13 GB/s

tq2_0
  vec_dot_q
    4096 values (0.02 MB)
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :     65.76 GB/s
      quantized throughput :      4.24 GB/s

Size

The token embeddings are kept at Q4_K and the output projection at Q6_K, which means the smaller models might be slightly bigger than 2 bits per weight.

All of the TriLM models should work, because their row sizes are multiples of 256. I did not try them all yet, but those I tried are in the table below.

The BitNet b1.58 models from the 1bitLLM team however are not all compatible; only the 700M model has dimensions divisible by 256. The others are not supported (yet), unless when padding them.

Model F16 TQ1_0 TQ2_0
https://huggingface.co/1bitLLM/bitnet_b1_58-large (728.84 M) 1391.26 MiB 176.65 MiB 207.03 MiB
https://huggingface.co/SpectraSuite/TriLM_390M_Unpacked 750.39 MiB 128.04 MiB 140.98 MiB
https://huggingface.co/SpectraSuite/TriLM_1.5B_Unpacked 2892.09 MiB 401.54 MiB 460.04 MiB
https://huggingface.co/SpectraSuite/TriLM_2.4B_Unpacked 4696.86 MiB 603.59 MiB 703.26 MiB
https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked 7616.43 MiB 948.16 MiB 1112.70 MiB

Note

The 1.3B BitNet b1.58 model has a FFN size of 5460 which factors into 2 2 3 5 7 13, which is not convenient for any block-wise types based on powers of 2, so these tensors are kept as F16. My hypothesis is that 5460 was a typo for 5440 (factors into 2 2 2 2 2 2 5 17), but it was kept for some reason, and reproduced by the 1bitLLM team. If anyone training ternary models reads this, PLEASE DON'T USE 5460 FOR THE FFN SIZE! Please use multiples of 256 for your row sizes.

Perplexity

Quality seems good. I don't have a powerful machine, so my tests only include the first 16 chunks of wikitext-2-raw with https://huggingface.co/SpectraSuite/TriLM_390M_Unpacked.

The tests below use Q4_K token embeddings and Q6_K output tensor for TQ1_0 and TQ2_0, while F16 token embeddings and output tensor is used in TQ1_0_L and TQ2_0_L.

chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
TQ1_0 16 23.6336 ± 1.0765 0.00463 ± 0.00141 0.00187 ± 0.00002 0.860 ± 0.020 % 97.279 ± 0.255 %
TQ2_0 16 23.6336 ± 1.0765 0.00463 ± 0.00141 0.00187 ± 0.00002 0.860 ± 0.020 % 97.279 ± 0.255 %
TQ1_0_L 16 23.5758 ± 1.0746 0.00218 ± 0.00112 0.00034 ± 0.00001 0.405 ± 0.012 % 98.971 ± 0.158 %
TQ2_0_L 16 23.5758 ± 1.0746 0.00218 ± 0.00112 0.00034 ± 0.00001 0.405 ± 0.012 % 98.971 ± 0.158 %

From this it seems like there is no significant quality loss for the ternary quants for TriLM models (I think the difference with pure f16 comes from the 8-bit activations), and that TQ1_0 and TQ2_0 are completely equivalent in quality (and they should be, because lossless conversion between the two is possible).

Structure of TQ1_0

This type relies on the fact that 3^5 == 243 < 256 == 2^8.

In a block of 256 elements, there are 240 elements encoded in 5 elements per byte, while the last 16 elements are encoded in 4 elements per byte.

This means (240 / 5) + (16 / 4) == 48 + 4 == 52 bytes are used to pack 256 ternary weights (this is 1.625 bits per weight).

But there is also one float16 scale per block, so the size of a block is 54 bytes making it a 1.6875 bpw type. Even though it's not ideal, this is still 1.6875 / (log(3) / log(2)) ≈ 94% of the best ternary packing efficiency.

In the table below I'm describing the order of the elements within the bytes. I'm using ranges to make this shorter, with the notation start..end where the start is inclusive and the end is exclusive. (So 0..3 is {0, 1, 2})

Read this as if the ranges of a row are zipped together. A byte never contains more than 5 ternary values.

The ternary values are stored unsigned, so {-1, 0, 1} is stored as {0, 1, 2}.

byte x * 3-1 x * 3-2 x * 3-3 x * 3-4 x * 3-5
0..32 0..32 32..64 64..96 96..128 128..160
32..48 160..176 176..192 192..208 208..224 224..240
48..52 240..244 244..248 248..252 252..256 N/A

And then byte 52 and 53 contain the float16 scale in little-endian.

Values are stored in fixed point to allow extracting the most significant digit first. This is explained in https://compilade.net/blog/ternary-packing.

Structure of TQ2_0

This type was originally inspired by the Q2_2 type made by @Eddie-Wang1120, but the block size, the order, and the mapping of the values are different.

TQ2_0 started as an experiment to see how fast a 2-bit type can be compared to a 1.6-bit type on compute-bound hardware.

This packs each ternary value in 2 bits, which means each byte contains 4 values.

The ternary values are stored unsigned, so {-1, 0, 1} is stored as {0, 1, 2}.

Again, the ranges use the start..end notation where the start is inclusive and the end is exclusive, and the ranges of a row should be read as being zipped together (they advance in parallel in lockstep).

byte x << 6 x << 4 x << 2 x << 0
0..32 96..128 64..96 32..64 0..32
32..64 224..256 192..224 160..192 128..160

And then byte 64 and 65 contain the float16 scale in little-endian.

TODO

  • Implement Numpy (de)quantization for TQ1_0 and TQ2_0
  • Allow convert_hf_to_gguf.py to directly convert a ternary model to a ternary encoding
    • Using f16 for the token embeddings and output tensor because Q4_K and Q6_K quantization is not yet supported by gguf-py. This means llama-quantize needs to be used to quantize these tensors.
    • Make it more obvious that the models should go through llama-quantize afterwards.
      • Maybe use other type names, like TQ1_0_L or something?
  • Decide whether the float16 scale should be before or after the packed weights
    • I'd prefer it after because I feel like the scales are read after the weights in dot products, but the convention with the other types (except for Q2_K, Q3_K and Q6_K) is to keep the scale before.
    • Okay, I've decided the scales should stay at the end.
  • More graceful fallback conversion with llama-quantize
    • Using Q4_0 as a fallback type, because the smallest symmetric quant type is Q8_0 but it's a bit big, so Q4_0 it is (even though it's not ideal). Only relevant when row sizes are not multiples of 256.
  • Unify the __ARM_FEATURE_DOTPROD variants of the dot products of TQ1_0 and TQ2_0 with their bare __ARM_NEON variants to reduce code duplication.
  • Test TQ1_0 and TQ2_0 for correctness on an ARM CPU which supports dot product instructions
    • Tested on an AWS t4g.small instance.
    • Also test relative performance for fun
  • Should TQ1_0's first 48 bytes be divided in 3 sub-blocks of 16 bytes (80 elements) instead of one of 32 bytes (140 elements) and one of 16 bytes?
    • I've done the 32-16 split to use 256-bit registers on AVX2 for the pow3 shifts for at least the 32 byte part, but 16-16-16 would be more regular, although it would require using 128-bit registers for all the ternary shifts. Not sure if there's a performance difference.
  • Rename references to "BitNet 1.58b" to "BitNet b1.58". The "b" comes before in the paper.
  • Find a naming convention for BitNet quants and rename Q1_3 and Q2_2
    • They were renamed and redesigned as TQ1_0 and TQ2_0.
  • Decide to keep or to remove the optimization for ggml_mul when the broadcasted tensor only has a single element
  • Fix Android CI build issues.
    • It was apparently a problem with Arm 32-bit. Fixed in 8fbd593

compilade ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
bd807499
compilade ggml-quants : faster 1.625 bpw AVX2 vec_dot
7ef4254a
compilade ggml-quants : substract 1 when back in epi8
48b73b84
compilade ggml-quants : Q2_2 now faster than Q4_K on with AVX2
ef1e345c
compilade ggml-quants : cleanup Q1_3 code formatting
638ad52f
compilade ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
9465ec6e
compilade ggml-quants : use ceiling division when quantizing q1_3
89dc3b25
compilade convert-hf : simplify BitNet pre-quantization
961e2938
compilade convert-hf : allow converting the weird BitNet 1.3B
09961499
compilade compilade added enhancement
compilade compilade added python
compilade compilade added Review Complexity : High
compilade compilade added ggml
compilade compilade added Tensor Encoding Scheme
compilade compilade force pushed from 4522ed78 to 09961499 327 days ago
github-actions github-actions added testing
github-actions github-actions added examples
Eddie-Wang1120
Eddie-Wang1120327 days ago (edited 327 days ago)👍 5

Wonderful job! I'm wondering if this PR can merge into the master branch, it would be so good if users of llama.cpp can use Q2_2 and Q1_3 conveniently.

compilade compilade changed the title ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b ggml-quants : 1.625 bpw ternary packing for BitNet b1.58 327 days ago
compilade bitnet : replace 1.58b with b1.58, as in the paper
bfd2f21f
compilade ggml-quants : fix build failure on Windows
ec50944b
compilade
compilade commented on 2024-06-29
examples/quantize/quantize.cpp
2626 { "IQ2_M", LLAMA_FTYPE_MOSTLY_IQ2_M, " 2.7 bpw quantization", },
2727 { "IQ1_S", LLAMA_FTYPE_MOSTLY_IQ1_S, " 1.56 bpw quantization", },
2828 { "IQ1_M", LLAMA_FTYPE_MOSTLY_IQ1_M, " 1.75 bpw quantization", },
29
{ "Q1_3", LLAMA_FTYPE_MOSTLY_Q1_3, " 1.63 bpw for BitNet b1.58", },
30
{ "Q2_2", LLAMA_FTYPE_MOSTLY_Q2_2, " 2.00 bpw for BitNet b1.58", },
compilade326 days ago (edited 326 days ago)👍 1

Regarding the names of the new quant types, since these are quite specific to BitNet models, I was thinking to name them with something starting with QB, a bit like suggested in #5761 (comment).

I'll first be describing what I want from the naming scheme, then I'll attempt to make it work.

The naming scheme should have room for:

  • Ternary types in {-1, 0, 1}
    • 1.625 bpw quant with a block size of 64, with 13 bytes per block
      • To make the smallest possible lossless BitNet b1.58 model files
      • Uses Q8_0 as its vec_dot_type (for the activations)
      • (It's technically possible to store a float16 scale in the leftover bits in the last byte of 16 consecutive blocks (this means 1024 elements minimum per row), although it can't really be extracted with SIMD)
    • 2.000 bpw quant with a block size of 32, with 8 bytes per block
      • For maximal performance
      • Uses Q8_0 as its vec_dot_type (for the activations)
    • 2.000 bpw quant with a block size of 64, with 16 bytes per block, and a float16 scale
      • Values would be packed similarly to the 1.625 bpw type, but with an extra byte and a row-wise float16 scale duplicated in each block.
    • 2.000 bpw quant with a block size of 4, with 1 byte per block
      • For weirdly-shaped models like the 1.3B BitNet b1.58 model
      • Needs a compatible vec_dot_type
        • float types are slower than integer types for this
  • Binary types in {-1, 1}
    • 1 bpw type
  • Binary types in {0, 1}
    • Are there models which use this?
  • 8-bit activation with a row-wise scale
    • 8.5 bpw like Q8_0, but all the scales of a row are the same
      • Would allow reducing the number of float32 operations in the vec_dot of the above types.
    • 10 bpw, 5 bytes per block of 4 elements, with a weird layout which only uses blocks to get a big enough buffer, with a single float32 scale and some padding before all row elements, aligned and contiguous.
      • For use with the weird 2.000 bpw type, and also maybe the other ones for best performance.

So the naming scheme could be:

  • QB<x>_<y>
    • where <x> is the floor of the expected bpw of the type
    • where <y> is
      • 0 binary type, {0, 1}
        • except for QB8_0 which is like Q8_0 but with a guaranteed duplicated row-wise scale
      • 1 binary type, {-1, 1}
      • 2 ternary type using some kind of binary-coded ternary
      • 3 ternary type with fixed-point packed values
      • 4 weird type with a block size of 4

Which for the previously-mentioned possible BitNet types would mean:

proposed name Range bits per weight block size bytes row-wise scale current name
QB1_3 {-1, 0, 1} 1.625 64 13 1.0f Q1_3
QB2_2 {-2, -1, 0, 1} 2.000 32 8 1.0f Q2_2
QB2_3 {-1, 0, 1} 2.000 64 16 f16
QB2_4 {-2, -1, 0, 1} 2.000 4 1 1.0f
QB1_1 {-1, 1} 1.000 ? ?/8 1.0f
QB1_0 {0, 1} 1.000 ? ?/8 1.0f
QB8_0 [-127, 127] 8.5 32 34 f16
QB8_4 [-127, 127] 10 4 5 f32, weird layout

I'm not saying these should all exist, though, only that the naming scheme should not be too limiting for possible future extensions (which might not exist anyway due to lack of time).

So I think I'll rename Q1_3 to QB1_3, and Q2_2 to QB2_2. Anyone has comments on this? Or a better naming scheme for the new BitNet quant types?

candre23325 days ago👍 1

If it were me, considering this only works with bitnet models and nothing else, I'd want the designations to be exceptionally clear that they are different and shouldn't be used on just anything. "QB" is good, but I'd take it a step further and remove the Q entirely. As bitnet is being colloquially referred to as a "1-bit" model, B1 makes more sense. Considering the plausible range for weights, I'd cut it off at tenths and ditch the decimal. This leaves plenty of room for variations, while making the native BPW very clear. I feel this is superior to the arbitrary "_2" and "_3" subtypes.

So what I would propose is:

1.625bpw = B1_16
2.000bpw = B1_20

compilade ggml-quants : attempt to fix Arm 32-bit support
8fbd5930
Green-Sky
Green-Sky325 days ago (edited 325 days ago)👀 1

@compilade and @Eddie-Wang1120 continuing the race to the bottom 🥳 , glorious.

Did some quick testing with the 3B model and it looks very good.

model size params backend threads test t/s
bitnet 3B Q1_3 - 1.625 bpw for BitNet b1.58 729.64 MiB 3.32 B BLAS 12 pp512 78.40 ± 0.27
bitnet 3B Q1_3 - 1.625 bpw for BitNet b1.58 729.64 MiB 3.32 B BLAS 12 tg128 38.16 ± 0.04
bitnet 3B Q2_2 - 2.000 bpw for BitNet b1.58 873.65 MiB 3.32 B BLAS 12 pp512 73.35 ± 6.23
bitnet 3B Q2_2 - 2.000 bpw for BitNet b1.58 873.65 MiB 3.32 B BLAS 12 tg128 36.86 ± 0.12

What surprises me a little, after reading about q2_2 being faster, is that q1_3 seems to be faster with the setup I used here. Will investigate further.

edit: also updated the files at https://huggingface.co/Green-Sky/bitnet_b1_58-3B-GGUF , for anyone else willing to test.

netrunnereve
netrunnereve325 days ago👍 3

Did a bit of testing myself, it runs and generates well but unfortunately it's the undertrained models rather than our implementation that's holding back BitNet adoption. For me Q1_3 is slower but this computer is CPU rather than memory bound.

model size params backend threads test t/s
bitnet 3B Q1_3 - 1.625 bpw for BitNet 1.58b 729.64 MiB 3.32 B CPU 4 pp512 15.15 ± 0.07
bitnet 3B Q1_3 - 1.625 bpw for BitNet 1.58b 729.64 MiB 3.32 B CPU 4 tg128 9.87 ± 0.65
bitnet 3B Q2_2 - 2.000 bpw for BitNet 1.58b 873.65 MiB 3.32 B CPU 4 pp512 19.25 ± 0.44
bitnet 3B Q2_2 - 2.000 bpw for BitNet 1.58b 873.65 MiB 3.32 B CPU 4 tg128 13.07 ± 0.28
bitnet 3B Q4_0 1.79 GiB 3.32 B CPU 4 pp512 18.44 ± 0.40
bitnet 3B Q4_0 1.79 GiB 3.32 B CPU 4 tg128 5.87 ± 0.12

I wonder if Q2_2 could be made faster if we used a block size of say 256 like the K-quants so that we can handle more than 64 bits of Q2_2 quants in each dot product loop. Aside from that I can't find any further way to improve that AVX implementation, and while it's ironic that we're using a madds instruction there when BitNet technically doesn't require multiplication that looks like the fastest way to dot the activations and ternary weights.

compilade
compilade324 days ago

I wonder if Q2_2 could be made faster if we used a block size of say 256 like the K-quants

Can't go with bigger blocks than 64 elements or else the 3B model won't be fully quantizable. (Its FFN size is 8640 (which factors into 2 2 2 2 2 2 3 3 3 5))

Its current block size is 32, which is the same as its vec_dot_type, Q8_0.

What would also help with performance would be to somehow use an 8-bit vec_dot_type having a single float scale per row. Might be interesting to explore later, but ggml does not have row-wise quant types yet, although this could still be done with a block quant.

it's ironic that we're using a madds instruction

Yeah, with AVX2, there are no good widening addition instructions like on ARM NEON, so _mm256_maddubs_epi16 is used for that.

Meanwhile, NEON doesn't have the equivalent of _mm_sign_epi8, so it needs to use multiplications or conditional masks, which are both slower than a dedicated instruction doing zeroing and sign flipping like in SSSE3.

ggerganov
ggerganov commented on 2024-07-07
ggml/src/ggml-quants.c
672 int8_t x2 = (int8_t)x[i*qk + 2*qk/4 + j];
673 int8_t x3 = (int8_t)x[i*qk + 3*qk/4 + j];
674
675
const uint8_t xi0 = x0 < 0 ? 1 : x0 == 0 ? 2 : 3;
676
const uint8_t xi1 = x1 < 0 ? 1 : x1 == 0 ? 2 : 3;
677
const uint8_t xi2 = x2 < 0 ? 1 : x2 == 0 ? 2 : 3;
678
const uint8_t xi3 = x3 < 0 ? 1 : x3 == 0 ? 2 : 3;
ggerganov317 days ago (edited 317 days ago)

As proposed, the type utilizes only 3 of the 4 possible values. I was thinking that the Q2_2 type would work the same as Q4_0, but assumes amax == 1.0f:

void quantize_row_q2_2_reference(const float * restrict x, block_q2_2 * restrict y, int64_t k) {
    static const int qk = QK2_2;

    assert(k % qk == 0);

    const int nb = k / qk;

    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }

        // assume amax = 1.0f
        max /= amax;

        const float d  = max / -2;
        const float id = d ? 1.0f/d : 0.0f;

        for (int j = 0; j < qk/4; ++j) {
            const float x0 = x[i*qk + 0*qk/4 + j]*id;
            const float x1 = x[i*qk + 1*qk/4 + j]*id;
            const float x2 = x[i*qk + 2*qk/4 + j]*id;
            const float x3 = x[i*qk + 3*qk/4 + j]*id;

            const uint8_t xi0 = MIN(3, (int8_t)(x0 + 2.5f));
            const uint8_t xi1 = MIN(3, (int8_t)(x1 + 2.5f));
            const uint8_t xi2 = MIN(3, (int8_t)(x2 + 2.5f));
            const uint8_t xi3 = MIN(3, (int8_t)(x3 + 2.5f));

            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 2;
            y[i].qs[j] |= xi2 << 4;
            y[i].qs[j] |= xi3 << 6;
        }
    }
}

(not tested, just pattern matching the existing quantize_row_q4_0_reference())

Edit: just realized the above would not work. We have assume that max == 1.0f, not amax, so:

const float max = 1.0f;
...
ggerganov
ggerganov commented on 2024-07-07
Conversation is marked as resolved
Show resolved
ggml/src/ggml.c
1004110065 GGML_ASSERT( nb0 == sizeof(float));
1004210066 GGML_ASSERT(nb00 == sizeof(float));
1004310067
10044 if (nb10 == sizeof(float)) {
10068
if (ggml_nelements(src1) == 1) {
10069
float scale = ((float *) src1->data)[0];
10070
for (int64_t ir = ith; ir < nr; ir += nth) {
10071
if (dst->data != src0->data) {
10072
// src0 is same shape as dst => same indices
10073
memcpy((char *)dst->data + ir*nb1, (char *)src0->data + ir*nb01, ne0 * sizeof(float));
10074
}
10075
ggml_vec_scale_f32(ne0, (float *) ((char *) dst->data + ir*nb1), scale);
10076
}
ggerganov317 days ago👍 4

It's ok to keep this optimization

compilade ggml : add some informative comments in q1_3 vec_dot
dd3e62a7
compilade Merge branch 'master' into compilade/bitnet-ternary
79a278e9
compilade
compilade295 days ago🎉 5

Whew, it has been a month since I last touched this, I got distracted for a bit.

(tl;dr at the end)

Now that new ternary models like TriLMs exist (https://arxiv.org/abs/2407.12327), which use multiple scales per tensors and which (fortunately) have all tensor dimensions divisible by 256 🎉, I think I should add a ternary type with 256 elements per block and a block-wise f16 scale. That would result in 1.6875 bpw, which sounds very reasonable to me.

Another ternary type with a scale but with a smaller block size (64) might be useful for compatibility with the BitNet b1.58 models from the 1bitLLM team (because their model dimensions are not divisible by 256), and would be 1.875 bpw or 2.0 bpw depending on whether padding 15 bytes of data to 16 bytes is better for performance.

These should have a similar inference speed as Q1_3, since they will use a similar packing scheme.

I'm not sure if it's worth it to keep the scale-less ternary quant types; I feel like they require too much special handling in the model graphs and in the convert script. It might be okay for BitNetForCausalLM, but not for some newer models like TriLMs which use LlamaForCausalLM, AKA not a ternary-specific architecture.

So I'll be proposing 4 (starting with 2) types, with yet another attempt at a naming scheme1 for ternary quants,
this time matching the regex TQ\d(_\dF?)?:

  • TQ1_0
    • ternary quant with 256 elements per block at 1.6875 bpw.
    • the packing would be similar to Q1_3, but repeated 4 times, and with a f16 scale.
    • its vec_dot_type could be Q8_K
  • TQ1_0F
  • (maybe) TQ2_0
    • ternary quant with 256 elements per block at 2.0625 bpw.
    • similar packing as Q2_2, so it should be performant, unless on platforms where the misalignment from the 2 bytes of the scale has some effect.
    • its vec_dot_type could be Q8_K
    • much simpler than IQ2_XSS, which can't even represent 0 unless the whole block is 0.
  • (maybe) TQ2_0F
    • same as TQ1_0F, but based on Q2_2 instead of Q1_3.
    • 2.25 bpw

Note that IQ2_XXS is already a 256-element type with similar properties as TQ2_0, although IQ2_XXS's packing scheme is much more complicated and I feel like its reliance on iq2xxs_grid makes it unnecessarily slower than it could be.2

I'll work on at least TQ1_0 and TQ1_0F in the next days, but I might get distracted. I'm doing this as a hobby in my free time, so it's possible that my priorities shift depending on external factors. This means anyone interested should feel free to ping me if I seem to have forgotten this again.

TL;DR: I think I'll replace the scale-less Q1_3 and Q2_2 with ternary types with a block-wise scale, which should allow supporting both BitNet b1.58 and TriLMs, while also simplifying the conversion for BitNet b1.58 because separate scale tensors won't be needed anymore.

Footnotes

  1. Some rationale for the naming scheme: using a special prefix to note that these are special-purpose, TQ stands for "ternary quant", not using QT to avoid confusion with https://www.qt.io/, and also because the IQ quants also prefix Q with a letter. I'm using _0 as suffix to mean that it has a scale similarly to Q8_0.

  2. Okay, I've read a bit about IQ2_XXS, and it seems slightly over-engineered and totally not intended for ternary models. Basically, it strongly relies on a lookup table (iq2xxs_grid), which contains 3 possible values in each byte: 0x08, 0x19 or 0x2b (8, 25, 45, respectively). This looks like where the absolute values of the elements comes from (before being scaled and signed). This means 0 is not representable unless the whole block is 0.

Green-Sky
Green-Sky295 days ago👍 1

@compilade Keep up the good work. You are a hero making living on the edge affordable 😄 .
Beside the others here of course... 😉

Not sure if anyone has noticed, but meta(facebook) changed the license for llama3.1 to allow training on outputs, which would allow for distillation.
So now I am waiting to see a bitnet distillation of the new 3.1 llamas pop up (hopefully).

mofosyne
mofosyne295 days ago

@compilade btw quick question regarding the packing structure of these encoding arrangements. Is there a consistent way to extract the bit pattern structure from the source code? It's a bit hard to grok the superblock, blocks and how bits are being packed for documentation. Ideally too I would like such documentation to be autogenerated as well, but until I can understand the basics from the C struct... it's a bit hard to get started.

ggerganov
ggerganov295 days ago

The plan sounds good. I wouldn't worry about the fallback types - we already have a workaround via padding for such kind of models, plus I doubt there will be much of those in the future.

compilade
compilade295 days ago

we already have a workaround via padding for such kind of models

@ggerganov While it mostly works, padding like in e9f2abf isn't correct with ggml_rms_norm, because the row size is used to calculate the mean.

https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml.c#L11813

To make padding work properly, there would need to be some special handling to make it possible to use ne[0] values which are not multiples of the block size (like making ggml_row_size round up).

The GGUF file format should already support that, since the tensor offsets don't directly depend on their size.

But GGUFWriter would need to avoid assuming a lossless round-trip between shape and byte shape.

Quantization and dequantization would need to be adapted, because the functions currently assume ne[0] is a multiple of the block size. But the quantize_row_*_ref functions don't necessarily know ne[0] directly (they get the total element count in a chunk of rows), but that should be easy enough to adapt with doing one call per row when padding is needed, a bit like applying importance matrices is done one row at a time. Or padding could be handled outside, but this would (momentarily) use more memory for the padded f32 copies (unpadding can be done with views).

Dot products would need no change if the padding values are equivalent to zero (this won't work for IQ2_XXS and likely other IQ types which can't represent zero).

I wouldn't worry about the fallback types

Understood. I agree with adding fewer types. And using padding could even let the cursed https://huggingface.co/1bitLLM/bitnet_b1_58-xl be quantized with its weird FFN size of 5460 which factors into 2 2 3 5 7 13.

I'll start with not handling padding, because it would affect other types too (notably Q8_K), and might be more appropriate in a separate PR.

compilade
compilade295 days ago

Is there a consistent way to extract the bit pattern structure from the source code? It's a bit hard to grok the superblock, blocks and how bits are being packed for documentation. Ideally too I would like such documentation to be autogenerated as well, but until I can understand the basics from the C struct... it's a bit hard to get started.

@mofosyne

No, unfortunatley, I don't think this can be easily automated. Sometimes a single field in the structs stores multiple types of values, like in Q4_K where block_q4_K.scales stores 6-bit scales and mins in some pattern1. The easiest way to understand what the bits mean is to have a look at the respective dequantize_row function of each type.

Footnotes

  1. The 12 bytes in Q4_K .scales are packed a bit like this, where the uppercased letters are bits for the scales and lowercased letters are the bits of the mins:

     0: EEAAAAAA
     1: FFBBBBBB
     2: GGCCCCCC
     3: HHDDDDDD
     4: eeaaaaaa
     5: ffbbbbbb
     6: ggcccccc
     7: hhdddddd
     8: eeeeEEEE
     9: ffffFFFF
    10: ggggGGGG
    11: hhhhHHHH
    

    Source: https://github.com/ggerganov/llama.cpp/blob/75af08c475e285888f66556d0f459c533b7deb95/ggml/src/ggml-quants.c#L1891-L1898

mofosyne
mofosyne295 days ago (edited 294 days ago)

@compilade thanks for the explanation it is interesting to see that the bits are split into 2bit and 4bits and uses only bitwise actions. Is this because it's preferred over packing each 6bit scale in sequential order, because each access is aligned or is cheaper to use bitwise operations?


edit: Ah likely to be more friendlier for parallel processing in gpu etc...

ggerganov
ggerganov294 days ago

@ggerganov While it mostly works, padding like in e9f2abf isn't correct with ggml_rms_norm, because the row size is used to calculate the mean

Correct, the norm can be applied on a view having the original size though (1D tensors used for normalisations are never quantised).

compilade ggml : add TQ1_0 and TQ2_0 ternary quantization types
77b8f84a
compilade
compilade294 days ago (edited 293 days ago)👍 17😄 2🎉 2❤ 2🚀 4👀 3

I've made some preliminary performance (speed) tests with TQ1_0 and TQ2_0, and TQ1_0 is faster than Q1_3, now around the speed of Q8_0, while TQ2_0 got a very big perf boost and is twice as fast as TQ1_0, which makes it by far the fastest quant type (around 2x faster than Q8_0, and 1.7x faster than Q2_K)1, at least with AVX2 on my machine. Bigger block sizes do pay off!

(And Q8_K is a very good vec_dot_type, with a f32 scale and even pre-computed sums)

Note that this is about the vec_dot speed and not the overall speed, although it's usually where most of the compute time is spent.

The formats of TQ1_0 and TQ2_0 are a bit different than what I initially planned, to make the data more convenient to access in the AVX2 vec_dot. Something nice is that unlike Q1_3, TQ1_0 does not rely on reading past the buffer (Q1_3 has 13 byte blocks which were read in 16 byte chunks).

A possible future improvement for the AVX2 vec_dot of TQ1_0 would be to test if 16-bit multiplies and permutes are faster or not than more elaborate ways to shift 8-bit values by powers of 3 (AVX2 does not have non-widening 8-bit multiplies), but both approaches were mostly similar in performance on my machine, so I went with the 8-bit operations.

I'll port TQ1_0 and TQ2_0 to ARM NEON in the next days, and I'll remove Q1_3 and Q2_2 after making comparisons on low-end ARM devices.


Is this because it's preferred over packing each 6bit scale in sequential order, because each access is aligned or is cheaper to use bitwise operations?

@mofosyne I had no part in the decision of the scale packing in Q4_K, but I think it's like this because indexing is only done at the byte level, so packing and unpacking 6-bit values has to use bitwise operations. Pointers can only jump at a minimum of a byte at a time. Also when making the vec_dot of Q1_3 I've noticed that shuffles are surprisingly as fast as additions in SIMD.

Footnotes

  1. Proof:

    Output of test-quantize-perf (click to expand)
    $ for t in q4_0 q8_0 q4_K q2_K tq2_0 tq1_0 q1_3 q2_2; do ./bin/test-quantize-perf --op vec_dot_q --type $t -i 10000000; done
    q4_0
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      3.03
          avg cycles/32 vals   :      3.33
          float32 throughput   :     52.88 GB/s
          quantized throughput :      7.44 GB/s
    
    q8_0
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      2.24
          avg cycles/32 vals   :      2.51
          float32 throughput   :     68.26 GB/s
          quantized throughput :     18.13 GB/s
    
    q4_K
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      2.22
          avg cycles/32 vals   :      2.68
          float32 throughput   :     62.68 GB/s
          quantized throughput :      8.81 GB/s
    
    q2_K
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      1.75
          avg cycles/32 vals   :      1.99
          float32 throughput   :     81.82 GB/s
          quantized throughput :      6.71 GB/s
    
    tq2_0
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      0.83
          avg cycles/32 vals   :      0.95
          float32 throughput   :    144.50 GB/s
          quantized throughput :      9.31 GB/s
    
    tq1_0
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      2.11
          avg cycles/32 vals   :      2.29
          float32 throughput   :     71.35 GB/s
          quantized throughput :      3.76 GB/s
    
    q1_3
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      2.94
          avg cycles/32 vals   :      3.46
          float32 throughput   :     50.02 GB/s
          quantized throughput :      2.54 GB/s
    
    q2_2
      vec_dot_q
        4096 values (0.02 MB)
          min cycles/32 vals   :      2.12
          avg cycles/32 vals   :      2.33
          float32 throughput   :     73.31 GB/s
          quantized throughput :      4.58 GB/s
    
compilade ggml : even faster TQ2_0
560873f3
compilade ggml : also faster TQ1_0
e9719576
flatsiedatsie
flatsiedatsie293 days ago

(some discussion around compilade's improvement can be found on Reddit)

compilade ggml : fix build issues in certain environments
a6dd6994
compilade ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
5417089a
compilade ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
45719a24
compilade
compilade292 days ago (edited 292 days ago)👍 3🚀 1

I've tested that a round-trip quantization between TQ1_0 and TQ2_0 is lossless, which means one can always be made from the other.

$ ./build/bin/llama-quantize models/trilm-390M-f16.gguf models/trilm-390M-tq1_0.gguf tq1_0
$ ./build/bin/llama-quantize models/trilm-390M-f16.gguf models/trilm-390M-tq2_0.gguf tq2_0
$ ./build/bin/llama-quantize --allow-requantize models/trilm-390M-tq1_0.gguf models/trilm-390M-tq2_0-requant.gguf tq2_0
$ ./build/bin/llama-quantize --allow-requantize models/trilm-390M-tq2_0-requant.gguf models/trilm-390M-tq1_0-roundtrip.gguf tq1_0
$ cd models
$ sha256sum trilm-390M-tq*
e4c622fb10dcfa30d427eb94eb08ffdcbde8ef3683a2b43a1b1eac8ab6e3e67f  trilm-390M-tq1_0.gguf
e4c622fb10dcfa30d427eb94eb08ffdcbde8ef3683a2b43a1b1eac8ab6e3e67f  trilm-390M-tq1_0-roundtrip.gguf
4edaaa33f8d7ffeaac72d758bf0e253512128a4a872a9c428bf337abb21a64be  trilm-390M-tq2_0.gguf
4edaaa33f8d7ffeaac72d758bf0e253512128a4a872a9c428bf337abb21a64be  trilm-390M-tq2_0-requant.gguf

I've also added ARM NEON implementations of vec_dot for TQ1_0 and TQ2_0, but the relative speedup on a Raspberry Pi 4 B is less impressive than with AVX2 on my laptop. There might still be ways to optimize the use of ARM NEON in there.

Still, it's decent, at 1.6x the speed of Q8_0 for TQ2_0. But the RPi4 is very memory bound (with a bandwidth only around 3GB/s), so actual inference speed is relatively much better with smaller types.

But I'm happy that TQ1_0 is 1.75x as fast as Q1_3 on that machine. The gap between TQ1_0 and TQ2_0 is also smaller than with AVX2.

Output of test-quantize-perf on a RPi4 (click to expand)
$ for t in q4_0 q8_0 q4_K q2_K tq2_0 tq1_0 q1_3 q2_2; do ./bin/test-quantize-perf --op vec_dot_q --type $t -i 2000000; done                                                                                                         
q4_0                                                                                                                                                                                                                                                                       
  vec_dot_q                                                                                                                          
    4096 values (0.02 MB)                                                                                                            
      min cycles/32 vals   :      0.00                                                                                               
      avg cycles/32 vals   :      0.00                                                                                               
      float32 throughput   :      7.82 GB/s                                                                                          
      quantized throughput :      1.10 GB/s                                                                                          
                                                                                                                                     
q8_0                                                                                                                                 
  vec_dot_q                                                                                                                          
    4096 values (0.02 MB)                                                                                                            
      min cycles/32 vals   :      0.00                                                                                               
      avg cycles/32 vals   :      0.00                                                                                               
      float32 throughput   :      9.57 GB/s                                                                                          
      quantized throughput :      2.54 GB/s                                                                                          
                                                                                                                                     
q4_K                                                                                                                                 
  vec_dot_q                                                                                                                          
    4096 values (0.02 MB)                                                                                                            
      min cycles/32 vals   :      0.00                                                                                               
      avg cycles/32 vals   :      0.00                                                                                               
      float32 throughput   :      9.38 GB/s                                                                                          
      quantized throughput :      1.32 GB/s                                                                                          
                                                                                                                                     
q2_K                                                                                                                                 
  vec_dot_q                                                                                                                          
    4096 values (0.02 MB)                                                                                                            
      min cycles/32 vals   :      0.00                                                                                               
      avg cycles/32 vals   :      0.00                                                                                               
      float32 throughput   :      9.64 GB/s
      quantized throughput :      0.79 GB/s
                                                                  
tq2_0                                
  vec_dot_q                              
    4096 values (0.02 MB)                
      min cycles/32 vals   :      0.00                                             
      avg cycles/32 vals   :      0.00                                             
      float32 throughput   :     15.35 GB/s                                        
      quantized throughput :      0.99 GB/s                                        
                                                                  
tq1_0                                          
  vec_dot_q                                    
    4096 values (0.02 MB)                      
      min cycles/32 vals   :      0.00                                                         
      avg cycles/32 vals   :      0.00                                                         
      float32 throughput   :     11.82 GB/s                                                    
      quantized throughput :      0.62 GB/s
                                                                  
q1_3                                                              
  vec_dot_q                                                       
    4096 values (0.02 MB)                                         
      min cycles/32 vals   :      0.00                            
      avg cycles/32 vals   :      0.00
      float32 throughput   :      6.75 GB/s
      quantized throughput :      0.34 GB/s
                                                                  
q2_2                                                              
  vec_dot_q                                                       
    4096 values (0.02 MB)                                                                                                            
      min cycles/32 vals   :      0.00
      avg cycles/32 vals   :      0.00
      float32 throughput   :      9.14 GB/s
      quantized throughput :      0.57 GB/s

The next steps are to remove Q1_3 and Q2_2, and to adapt the convert script to let it convert directly to at least one of TQ1_0 or TQ2_0.

compilade compilade marked this pull request as draft 292 days ago
compilade ggml : remove q1_3 and q2_2
04eec581
compilade compilade changed the title ggml-quants : 1.625 bpw ternary packing for BitNet b1.58 ggml-quants : ternary packing for TriLMs and BitNet b1.58 290 days ago
Green-Sky
Green-Sky290 days ago (edited 290 days ago)

I saw compilade remove the old bitnet quants, so I decided it was time for another round of tests.

Since the large bitnet repro model does not work with the new quants (as explained in the OP), I switched to the TriLM_3.9B model.

quant ppl ppl@300 filesize
f16 11.1532 +/- 0.07854 11.0180 7.5G
q8_0 11.1489 +/- 0.07851 11.015 4.0G
q4_0 11.4797 +/- 0.08058 11.3249 2.2G
q4_k 11.1559 +/- 0.07854 11.0223 2.3G
tq2_0 11.1558 +/- 0.07853 11.0200 1.1G
tq1_0 11.1558 +/- 0.07853 11.0200 949M

I added ppl at step 300 for reference to speed up future ppl calculations.

Note: Offloading tq2_0 layers to vram(cuda) improved the time by ~20%. It was still 10x slower than q4_k though.

As always I used default settings calculating perplexity over 560 chunks, n_ctx=512, batch_size=2048, n_seq=4

edit: uploaded some quantized files again https://huggingface.co/Green-Sky/TriLM_3.9B-GGUF

Green-Sky
compilade ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
f034aa1b
mirek190
compilade
mirek190
Green-Sky
ggerganov
mirek190
Green-Sky
ggerganov
BarfingLemurs
mirek190
compilade ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
96b3d411
Hugi-R
compilade Merge branch 'master' into compilade/bitnet-ternary
d911cd1f
compilade gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
3a0bf17d
compilade convert : allow direct conversion to TQ1_0 and TQ2_0
895004f3
compilade ggml-quants : allow using ARM dot product instructions for TQ1_0
69f77268
compilade Merge branch 'master' into compilade/bitnet-ternary
82b24040
compilade ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
35cc5567
compilade compilade marked this pull request as ready for review 280 days ago
ggerganov
ggerganov approved these changes on 2024-08-15
basavyr
flatsiedatsie
sorasoras
compilade
basavyr
flatsiedatsie
compilade Merge branch 'master' into compilade/bitnet-ternary
cb6d9962
compilade
flatsiedatsie
ggerganov
basavyr
flatsiedatsie
compilade Merge branch 'master' into compilade/bitnet-ternary
7f3a619c
compilade ggml ; remove unused ggml_mul special case
8d616076
compilade test-backend-ops : add TQ1_0 and TQ2_0 comments for later
75b3a096
compilade compilade force pushed from e4dc48a5 to 75b3a096 258 days ago
compilade
compilade compilade added merge ready
compilade compilade merged 9bc6db28 into master 257 days ago
WenguoLi
rhjdvsgsgks

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone