📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2
-q4_0
: 527 iterations 🚀
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 527 iterations"
y-axis "llamacpp:prompt_tokens_seconds"
x-axis "llamacpp:prompt_tokens_seconds" 1716476431 --> 1716477057
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.5, 370.5, 370.5, 370.5, 370.5, 897.45, 897.45, 897.45, 897.45, 897.45, 931.93, 931.93, 931.93, 931.93, 931.93, 910.64, 910.64, 910.64, 910.64, 910.64, 913.93, 913.93, 913.93, 913.93, 913.93, 933.06, 933.06, 933.06, 933.06, 933.06, 918.94, 918.94, 918.94, 918.94, 918.94, 911.73, 911.73, 911.73, 911.73, 911.73, 911.08, 911.08, 911.08, 911.08, 911.08, 924.12, 924.12, 924.12, 924.12, 924.12, 918.9, 918.9, 918.9, 918.9, 918.9, 924.42, 924.42, 924.42, 924.42, 924.42, 940.48, 940.48, 940.48, 940.48, 940.48, 928.43, 928.43, 928.43, 928.43, 928.43, 922.96, 922.96, 922.96, 922.96, 922.96, 938.31, 938.31, 938.31, 938.31, 938.31, 872.83, 872.83, 872.83, 872.83, 872.83, 877.28, 877.28, 877.28, 877.28, 877.28, 878.68, 878.68, 878.68, 878.68, 878.68, 875.02, 875.02, 875.02, 875.02, 875.02, 831.24, 831.24, 831.24, 831.24, 831.24, 831.32, 831.32, 831.32, 831.32, 831.32, 827.98, 827.98, 827.98, 827.98, 827.98, 836.54, 836.54, 836.54, 836.54, 836.54, 838.99, 838.99, 838.99, 838.99, 838.99, 814.64, 814.64, 814.64, 814.64, 814.64, 817.76, 817.76, 817.76, 817.76, 817.76, 814.19, 814.19, 814.19, 814.19, 814.19, 762.91, 762.91, 762.91, 762.91, 762.91, 763.24, 763.24, 763.24, 763.24, 763.24, 764.72, 764.72, 764.72, 764.72, 764.72, 771.64, 771.64, 771.64, 771.64, 771.64, 772.88, 772.88, 772.88, 772.88, 772.88, 771.75, 771.75, 771.75, 771.75, 771.75, 782.64, 782.64, 782.64, 782.64, 782.64, 784.09, 784.09, 784.09, 784.09, 784.09, 790.53, 790.53, 790.53, 790.53, 790.53, 787.93, 787.93, 787.93, 787.93, 787.93, 785.44, 785.44, 785.44, 785.44, 785.44, 785.12, 785.12, 785.12, 785.12, 785.12, 784.82, 784.82, 784.82, 784.82, 784.82, 789.81, 789.81, 789.81, 789.81, 789.81, 791.52, 791.52, 791.52, 791.52, 791.52, 788.09, 788.09, 788.09, 788.09, 788.09, 790.83, 790.83, 790.83, 790.83, 790.83, 750.24, 750.24, 750.24, 750.24, 750.24, 749.28, 749.28, 749.28, 749.28, 749.28, 748.96, 748.96, 748.96, 748.96, 748.96, 749.96, 749.96, 749.96, 749.96, 749.96, 750.88, 750.88, 750.88, 750.88, 750.88, 750.83, 750.83, 750.83, 750.83, 750.83, 751.75, 751.75, 751.75, 751.75, 751.75, 756.07, 756.07, 756.07, 756.07, 756.07, 758.37, 758.37, 758.37, 758.37, 758.37, 764.0, 764.0, 764.0, 764.0, 764.0, 763.16, 763.16, 763.16, 763.16, 763.16, 765.44, 765.44, 765.44, 765.44, 765.44, 765.42, 765.42, 765.42, 765.42, 765.42, 767.66, 767.66, 767.66, 767.66, 767.66, 768.26, 768.26, 768.26, 768.26, 768.26, 770.42, 770.42, 770.42, 770.42]
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 527 iterations"
y-axis "llamacpp:predicted_tokens_seconds"
x-axis "llamacpp:predicted_tokens_seconds" 1716476431 --> 1716477057
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.99, 41.99, 41.99, 41.99, 41.99, 34.85, 34.85, 34.85, 34.85, 34.85, 30.34, 30.34, 30.34, 30.34, 30.34, 27.49, 27.49, 27.49, 27.49, 27.49, 25.73, 25.73, 25.73, 25.73, 25.73, 26.42, 26.42, 26.42, 26.42, 26.42, 26.91, 26.91, 26.91, 26.91, 26.91, 27.62, 27.62, 27.62, 27.62, 27.62, 28.31, 28.31, 28.31, 28.31, 28.31, 29.25, 29.25, 29.25, 29.25, 29.25, 29.46, 29.46, 29.46, 29.46, 29.46, 30.28, 30.28, 30.28, 30.28, 30.28, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.51, 30.51, 30.51, 30.51, 30.51, 30.22, 30.22, 30.22, 30.22, 30.22, 28.99, 28.99, 28.99, 28.99, 28.99, 28.36, 28.36, 28.36, 28.36, 28.36, 28.43, 28.43, 28.43, 28.43, 28.43, 28.95, 28.95, 28.95, 28.95, 28.95, 28.89, 28.89, 28.89, 28.89, 28.89, 28.86, 28.86, 28.86, 28.86, 28.86, 28.43, 28.43, 28.43, 28.43, 28.43, 28.65, 28.65, 28.65, 28.65, 28.65, 28.94, 28.94, 28.94, 28.94, 28.94, 28.98, 28.98, 28.98, 28.98, 28.98, 29.14, 29.14, 29.14, 29.14, 29.14, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.87, 29.87, 29.87, 29.87, 29.87, 29.89, 29.89, 29.89, 29.89, 29.89, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.34, 30.34, 30.34, 30.34, 30.34, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 29.7, 29.7, 29.7, 29.7, 29.7, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.95, 29.95, 29.95, 29.95, 29.95, 29.98, 29.98, 29.98, 29.98, 29.98, 30.12, 30.12, 30.12, 30.12, 30.12, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.46, 29.46, 29.46, 29.46, 29.46, 29.2, 29.2, 29.2, 29.2, 29.2, 28.56, 28.56, 28.56, 28.56, 28.56, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.56, 28.56, 28.56, 28.56, 28.56, 28.55, 28.55, 28.55, 28.55, 28.55, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.57, 28.57, 28.57, 28.57, 28.57, 28.55, 28.55, 28.55, 28.55, 28.55, 28.49, 28.49, 28.49, 28.49, 28.49, 28.58, 28.58, 28.58, 28.58, 28.58, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87]
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 527 iterations"
y-axis "llamacpp:kv_cache_usage_ratio"
x-axis "llamacpp:kv_cache_usage_ratio" 1716476431 --> 1716477057
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.42, 0.42, 0.42, 0.42, 0.42, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.38, 0.38, 0.38, 0.38, 0.38, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.43, 0.43, 0.43, 0.43, 0.43, 0.58, 0.58, 0.58, 0.58, 0.58, 0.49, 0.49, 0.49, 0.49, 0.49, 0.47, 0.47, 0.47, 0.47, 0.47, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]
---
config:
xyChart:
titleFontSize: 12
width: 900
height: 600
themeVariables:
xyChart:
titleColor: "#000000"
---
xychart-beta
title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
duration=10m 527 iterations"
y-axis "llamacpp:requests_processing"
x-axis "llamacpp:requests_processing" 1716476431 --> 1716477057
line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0]
Could you demonstrate short perplexity
runs produce reasonable values compared to no-SVE?
Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs.
### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226
llama_print_timings: load time = 314.22 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 9876.98 ms / 512 tokens ( 19.29 ms per token, 51.84 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 10796.42 ms / 513 tokens
### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261
llama_print_timings: load time = 304.68 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 3940.02 ms / 512 tokens ( 7.70 ms per token, 129.95 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4868.40 ms / 513 tokens
### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378
llama_print_timings: load time = 13751.66 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 10110.36 ms / 512 tokens ( 19.75 ms per token, 50.64 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 11021.03 ms / 513 tokens
### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407
llama_print_timings: load time = 184.21 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4340.33 ms / 512 tokens ( 8.48 ms per token, 117.96 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 5254.53 ms / 513 tokens
And below is a summary.
SIMD | Type | PPL | total time[ms] |
---|---|---|---|
NEON | Q8_0 | 8.4178 +/- 1.61226 | 10796.42 |
SVE | Q8_0 | 8.4219 +/- 1.61261 | 4868.4 |
NEON | Q4_0 | 9.0525 +/- 1.80378 | 11021.03 |
SVE | Q4_0 | 9.0456 +/- 1.80407 | 5254.53 |
This correction does not appear to have any impact on accuracy.
Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic
These VMs are currently in preview - when they become generally available, we can add ggml-ci
for that instruction set
I don't understand why, but after this PR I was having build issues on one of my machines when using make
where the GPU could not be detected to determine the correct CUDA arch for -arch=native
even though there was no change to the Makefile. However, this seems to have been related to ccache since the compilation worked with LLAMA_NO_CCACHE
; deleting ~/.cache/ccache
has permanently fixed the issue for me.
@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute!
Login to write a write a comment.
This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.
This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.
SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:
Here are the performance measured on AWS Graviton3E (hpc7g).
Q4_0_Q8_0
Decoding throughput[token/sec]
Q8_0_Q8_0
Decoding throughput[token/sec]
Limitation: This pull request only supports SVE 256-bit.