llama.cpp
ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot
#7433
Merged

ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot #7433

msy-kato
msy-kato1 year ago (edited 1 year ago)

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.

This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.

SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:

$ cmake -DLLAMA_SVE=ON -B build -S .
$ cmake --build build -j$(($(nproc)/2))

Here are the performance measured on AWS Graviton3E (hpc7g).

### Q4_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$  ./build/bin/main --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Q4_0_Q8_0

Decoding throughput[token/sec]

Threads Original(NEON) This PR(SVE) Ratio
2 3.16 4.05 1.28
4 6.21 7.88 1.27
8 11.92 14.81 1.24
16 21.54 25.77 1.20
32 32.38 36.21 1.12

Q8_0_Q8_0

Decoding throughput[token/sec]

Threads Original(NEON) This PR(SVE) Ratio
2 3.14 4.60 1.46
4 6.10 8.97 1.47
8 11.46 16.29 1.42
16 20.20 23.77 1.18
32 24.72 26.01 1.05

Limitation: This pull request only supports SVE 256-bit.

github-actions github-actions added build
github-actions github-actions added ggml
github-actions
github-actions1 year ago (edited 1 year ago)

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8868.41ms p(95)=22845.51ms fails=, finish reason: stop=470 truncated=57
  • Prompt processing (pp): avg=103.39tk/s p(95)=462.02tk/s
  • Token generation (tg): avg=47.79tk/s p(95)=48.22tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=feat-sve-q4_0_q8_0-q8_0_q8_0 commit=d28bfd5ef7492548d6e000b6ad2cb6042161ec95

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 370.5, 370.5, 370.5, 370.5, 370.5, 897.45, 897.45, 897.45, 897.45, 897.45, 931.93, 931.93, 931.93, 931.93, 931.93, 910.64, 910.64, 910.64, 910.64, 910.64, 913.93, 913.93, 913.93, 913.93, 913.93, 933.06, 933.06, 933.06, 933.06, 933.06, 918.94, 918.94, 918.94, 918.94, 918.94, 911.73, 911.73, 911.73, 911.73, 911.73, 911.08, 911.08, 911.08, 911.08, 911.08, 924.12, 924.12, 924.12, 924.12, 924.12, 918.9, 918.9, 918.9, 918.9, 918.9, 924.42, 924.42, 924.42, 924.42, 924.42, 940.48, 940.48, 940.48, 940.48, 940.48, 928.43, 928.43, 928.43, 928.43, 928.43, 922.96, 922.96, 922.96, 922.96, 922.96, 938.31, 938.31, 938.31, 938.31, 938.31, 872.83, 872.83, 872.83, 872.83, 872.83, 877.28, 877.28, 877.28, 877.28, 877.28, 878.68, 878.68, 878.68, 878.68, 878.68, 875.02, 875.02, 875.02, 875.02, 875.02, 831.24, 831.24, 831.24, 831.24, 831.24, 831.32, 831.32, 831.32, 831.32, 831.32, 827.98, 827.98, 827.98, 827.98, 827.98, 836.54, 836.54, 836.54, 836.54, 836.54, 838.99, 838.99, 838.99, 838.99, 838.99, 814.64, 814.64, 814.64, 814.64, 814.64, 817.76, 817.76, 817.76, 817.76, 817.76, 814.19, 814.19, 814.19, 814.19, 814.19, 762.91, 762.91, 762.91, 762.91, 762.91, 763.24, 763.24, 763.24, 763.24, 763.24, 764.72, 764.72, 764.72, 764.72, 764.72, 771.64, 771.64, 771.64, 771.64, 771.64, 772.88, 772.88, 772.88, 772.88, 772.88, 771.75, 771.75, 771.75, 771.75, 771.75, 782.64, 782.64, 782.64, 782.64, 782.64, 784.09, 784.09, 784.09, 784.09, 784.09, 790.53, 790.53, 790.53, 790.53, 790.53, 787.93, 787.93, 787.93, 787.93, 787.93, 785.44, 785.44, 785.44, 785.44, 785.44, 785.12, 785.12, 785.12, 785.12, 785.12, 784.82, 784.82, 784.82, 784.82, 784.82, 789.81, 789.81, 789.81, 789.81, 789.81, 791.52, 791.52, 791.52, 791.52, 791.52, 788.09, 788.09, 788.09, 788.09, 788.09, 790.83, 790.83, 790.83, 790.83, 790.83, 750.24, 750.24, 750.24, 750.24, 750.24, 749.28, 749.28, 749.28, 749.28, 749.28, 748.96, 748.96, 748.96, 748.96, 748.96, 749.96, 749.96, 749.96, 749.96, 749.96, 750.88, 750.88, 750.88, 750.88, 750.88, 750.83, 750.83, 750.83, 750.83, 750.83, 751.75, 751.75, 751.75, 751.75, 751.75, 756.07, 756.07, 756.07, 756.07, 756.07, 758.37, 758.37, 758.37, 758.37, 758.37, 764.0, 764.0, 764.0, 764.0, 764.0, 763.16, 763.16, 763.16, 763.16, 763.16, 765.44, 765.44, 765.44, 765.44, 765.44, 765.42, 765.42, 765.42, 765.42, 765.42, 767.66, 767.66, 767.66, 767.66, 767.66, 768.26, 768.26, 768.26, 768.26, 768.26, 770.42, 770.42, 770.42, 770.42]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.99, 41.99, 41.99, 41.99, 41.99, 34.85, 34.85, 34.85, 34.85, 34.85, 30.34, 30.34, 30.34, 30.34, 30.34, 27.49, 27.49, 27.49, 27.49, 27.49, 25.73, 25.73, 25.73, 25.73, 25.73, 26.42, 26.42, 26.42, 26.42, 26.42, 26.91, 26.91, 26.91, 26.91, 26.91, 27.62, 27.62, 27.62, 27.62, 27.62, 28.31, 28.31, 28.31, 28.31, 28.31, 29.25, 29.25, 29.25, 29.25, 29.25, 29.46, 29.46, 29.46, 29.46, 29.46, 30.28, 30.28, 30.28, 30.28, 30.28, 30.41, 30.41, 30.41, 30.41, 30.41, 30.66, 30.66, 30.66, 30.66, 30.66, 30.51, 30.51, 30.51, 30.51, 30.51, 30.22, 30.22, 30.22, 30.22, 30.22, 28.99, 28.99, 28.99, 28.99, 28.99, 28.36, 28.36, 28.36, 28.36, 28.36, 28.43, 28.43, 28.43, 28.43, 28.43, 28.95, 28.95, 28.95, 28.95, 28.95, 28.89, 28.89, 28.89, 28.89, 28.89, 28.86, 28.86, 28.86, 28.86, 28.86, 28.43, 28.43, 28.43, 28.43, 28.43, 28.65, 28.65, 28.65, 28.65, 28.65, 28.94, 28.94, 28.94, 28.94, 28.94, 28.98, 28.98, 28.98, 28.98, 28.98, 29.14, 29.14, 29.14, 29.14, 29.14, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.56, 29.56, 29.56, 29.56, 29.56, 29.87, 29.87, 29.87, 29.87, 29.87, 29.89, 29.89, 29.89, 29.89, 29.89, 30.18, 30.18, 30.18, 30.18, 30.18, 30.3, 30.3, 30.3, 30.3, 30.3, 30.34, 30.34, 30.34, 30.34, 30.34, 30.22, 30.22, 30.22, 30.22, 30.22, 30.04, 30.04, 30.04, 30.04, 30.04, 29.7, 29.7, 29.7, 29.7, 29.7, 29.61, 29.61, 29.61, 29.61, 29.61, 29.7, 29.7, 29.7, 29.7, 29.7, 29.95, 29.95, 29.95, 29.95, 29.95, 29.98, 29.98, 29.98, 29.98, 29.98, 30.12, 30.12, 30.12, 30.12, 30.12, 29.9, 29.9, 29.9, 29.9, 29.9, 29.83, 29.83, 29.83, 29.83, 29.83, 29.46, 29.46, 29.46, 29.46, 29.46, 29.2, 29.2, 29.2, 29.2, 29.2, 28.56, 28.56, 28.56, 28.56, 28.56, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.53, 28.56, 28.56, 28.56, 28.56, 28.56, 28.55, 28.55, 28.55, 28.55, 28.55, 28.65, 28.65, 28.65, 28.65, 28.65, 28.66, 28.66, 28.66, 28.66, 28.66, 28.57, 28.57, 28.57, 28.57, 28.57, 28.55, 28.55, 28.55, 28.55, 28.55, 28.49, 28.49, 28.49, 28.49, 28.49, 28.58, 28.58, 28.58, 28.58, 28.58, 28.83, 28.83, 28.83, 28.83, 28.83, 28.87, 28.87, 28.87, 28.87]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.42, 0.42, 0.42, 0.42, 0.42, 0.48, 0.48, 0.48, 0.48, 0.48, 0.41, 0.41, 0.41, 0.41, 0.41, 0.18, 0.18, 0.18, 0.18, 0.18, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.35, 0.35, 0.35, 0.35, 0.35, 0.38, 0.38, 0.38, 0.38, 0.38, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.43, 0.43, 0.43, 0.43, 0.43, 0.58, 0.58, 0.58, 0.58, 0.58, 0.49, 0.49, 0.49, 0.49, 0.49, 0.47, 0.47, 0.47, 0.47, 0.47, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.28, 0.28, 0.28, 0.28, 0.28, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.08, 0.08, 0.08, 0.08, 0.08, 0.09, 0.09, 0.09, 0.09]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716476431 --> 1716477057
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

mofosyne mofosyne added Review Complexity : High
msy-kato Add SVE support for q4_0_q8_0 q8_0_q8_0
19531ac4
msy-kato remove ifdef
d28bfd5e
msy-kato msy-kato force pushed from d671a171 to d28bfd5e 1 year ago
ggerganov
ggerganov approved these changes on 2024-05-23
ggerganov1 year ago

Could you demonstrate short perplexity runs produce reasonable values compared to no-SVE?

msy-kato
msy-kato1 year ago (edited 1 year ago)👍 1

Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs.

### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226

llama_print_timings:        load time =     314.22 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    9876.98 ms /   512 tokens (   19.29 ms per token,    51.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   10796.42 ms /   513 tokens

### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261

llama_print_timings:        load time =     304.68 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    3940.02 ms /   512 tokens (    7.70 ms per token,   129.95 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    4868.40 ms /   513 tokens

### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378

llama_print_timings:        load time =   13751.66 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   10110.36 ms /   512 tokens (   19.75 ms per token,    50.64 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   11021.03 ms /   513 tokens

### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407

llama_print_timings:        load time =     184.21 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    4340.33 ms /   512 tokens (    8.48 ms per token,   117.96 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    5254.53 ms /   513 tokens

And below is a summary.

SIMD Type PPL total time[ms]
NEON Q8_0 8.4178 +/- 1.61226 10796.42
SVE Q8_0 8.4219 +/- 1.61261 4868.4
NEON Q4_0 9.0525 +/- 1.80378 11021.03
SVE Q4_0 9.0456 +/- 1.80407 5254.53

This correction does not appear to have any impact on accuracy.

ggerganov ggerganov merged faa0e697 into master 1 year ago
ggerganov
ggerganov1 year ago👍 1

Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic
These VMs are currently in preview - when they become generally available, we can add ggml-ci for that instruction set

JohannesGaessler
JohannesGaessler1 year ago👍 2

I don't understand why, but after this PR I was having build issues on one of my machines when using make where the GPU could not be detected to determine the correct CUDA arch for -arch=native even though there was no change to the Makefile. However, this seems to have been related to ccache since the compilation worked with LLAMA_NO_CCACHE; deleting ~/.cache/ccache has permanently fixed the issue for me.

msy-kato
msy-kato1 year ago (edited 1 year ago)👍 1

@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute!

Login to write a write a comment.

Login via GitHub

Reviewers
Assignees
No one assigned
Labels
Milestone