llama.cpp
Initial ET backend
#24179
Open

Initial ET backend #24179

marty1885 wants to merge 299 commits into ggml-org:master from marty1885:et-backend-n-way-merge
marty1885
vidas ggml-et: Add MUL_MAT_ID kernel
f4f00f10
vidas ggml-et: Build et kernels as part of ggml
7614dc85
vidas ggml-et: Embed kernels with fs fallback
0100c79f
vidas ggml-et: Build fixes
0a7a3cee
vidas ggml-et: Add MUL_MAT F32xF32 op
801c5eab
vidas ggml_et: Add MUL_MAT_ID op
c61c2fce
vidas ggml-et: Disable offloading for debug
71d56529
vidas ggml-et: Refactor out block ops
e74bf85c
vidas ggml-et: ggml backend API changes
fa5494fc
vidas ggml-et: Add RESHAPE/TRANSPOSE to supported
afcbd878
vidas ggml-et: Add CONT_F16
e90637c6
vidas ggml-et: Add supported ops doc
944804f6
vidas gglm-et: Initial doc
51b23a39
glguida ggml-et: Remove runtime import hacks
fd889d95
vidas ggml-et: Fix GET_ROWS kernel
6dc9a8ca
vidas ggml-et: Fix SET_ROWS kernel
2512acff
vidas ggml-et: Use custom instruction for fp32->fp16
e9bb5311
vidas ggml-et: Vectorize set_rows fp32->fp16
3bc8f5a0
vidas ggml-et: Fix ROPE kernel (yarn)
b3bbea7d
vidas ggml-et: Better sinf
b550d35b
vidas ggml-et: Fix SOFT_MAX
8d91acab
vidas ggml-et: Fix CONT
7ddf862c
vidas ggml-et: Fix elmap kernel
22898d2c
vidas ggml-et: Fix MUL_MAT MUL_MAT_ID remainders
4985ba59
vidas ggml-et: Fix ET-SOC reference
b82b1024
z8team ggml-et: Fix embed kernels scripts for old python
4f027eea
glguida Merge pull request #3 from glguida/fix_old_python
756d5a17
ubergarm Add sysemu support with compile time flag `-DGGML_ET_SYSEMU=ON` (#6)
f36eb795
marty1885 build: proper dep tracking for kernels
c58ef523
marty1885 support host using MOLD linker
87b4e717
marty1885 initial multi core GET_ROW F32 implementation
9758af4d
marty1885 vectorized q8 dequant
26b0627b
marty1885 wip: cland warning clenaups and initial logging refactor
08eecf7b
marty1885 wip: message default message cleanup
4420c99e
marty1885 chore: message cleanups
3610ae62
marty1885 cmake cleanup
127ba8d7
marty1885 migrate to use platform provided functions
8832c224
marty1885 cmake back into subdir
f339ffe0
marty1885 support et_print() in kernels
8a721243
marty1885 fix: repair kernel building
e836cdb9
marty1885 perf: operations run async by default
5f6c5715
marty1885 debug: proper kernel dep tracking and error detection on kenrel launch
bd8d500a
marty1885 fix: kernel binary dep tracking and fixing get_rows_f32 erroring
e10f730c
marty1885 perf: back to doing async kernel runs by default
4112bfff
marty1885 perf: vectorize and parallel device memset
61dc924e
marty1885 merge matmul work
cc0d09c7
marty1885 merge upstream
a04500a1
marty1885 misc: align allocation and enable all offload
9d1525d1
marty1885 misc: delete deadcode and respect memory limits
728f9f9f
marty1885 fix: repair tensor debug print
fe4d7cd0
marty1885 fix: loosen RMS_NORM op percision
96816df6
marty1885 feat: Q4_0 GET_ROWS
d76c3f8e
marty1885 perf: FP32 MUL_MAT using TensorFMA
faf61dcc
marty1885 update limitations
c0d7a1fe
marty1885 perf: redue L1 load in compute_block_dot_product_q8_0
28cc52b2
marty1885 feat: save kernel mapping (name to id) when profiling is enabled
a3482349
marty1885 chore: memops cleanup
e553ffa9
marty1885 perf: parallelize softmax by rows
7a2672fb
marty1885 perf: vectorize 2nd phase of softmax
0682f059
marty1885 Merge remote-tracking branch 'upstream/master' into backend-dev
c578ba9f
marty1885 perf: ban GET_ROWS from offloaded
f1e177cc
marty1885 perf: vectorize and non-atomic for eltwise ops and sub support
2a3fc321
marty1885 perf: vectorize normal rope
4db6a1f2
marty1885 perf: glu runs in parallel
fd6fa6ed
marty1885 merge: manually merge saqib's work on kernel fixes
5b26a727
marty1885 perf: more vectorized RoPE
42618f0d
marty1885 perf: parallelize mul_mat_id
3d987840
marty1885 perf: parallelize set_rows_f32
5b031145
marty1885 perf: vectorize softmax
dd3ada94
marty1885 feat: support kernel fusion and fuse RMS_NORM + MUL
48b86425
marty1885 fix: mostly resolve test-backend-ops failure in SOFT_MAX and ROPE
3b4f0d20
marty1885 fix: bump max rope dims for gemma
a42b1c81
marty1885 feat: GeGLU and SCALE support to fully offload Gemma
46c54843
marty1885 perf: faster device memset
12272a7d
marty1885 feat: get_rows supporting Q4_K and avoid cont cache coherent issues
ed0f3a01
marty1885 merge: merge upstream llama.cpp
07519b78
marty1885 better F32 MM
86dfc6dd
marty1885 feat: NORM for ET backend
cbc46348
marty1885 feat: SQR for ET backend
b8f2f741
marty1885 feat: UNARY on ET
2515f776
marty1885 feat: el_map support broadcasting for ET
d8e6161d
marty1885 feat: SUM_ROWS in ET backend
cf76d60f
marty1885 feat: more ops in ET backend
2b28f90c
marty1885 feat: WKV* operators in ET backend
42f866cf
marty1885 perf: parallelize operators across cacheline instead of row
13fbf78a
marty1885 perf: parallelize get_rows on cacheline
61fc7232
marty1885 wip: baseline FlashAttention for ET backend
55952bb3
marty1885 wip: enough FA and CPY f32->f16 to run llama 3.1 fully offloaded with…
27d228bb
marty1885 feat: f16 x f16 -> f32 MM using matrix engine
806f4d67
marty1885 wip: f16 FlashAttention using matrix engine
a7d7a786
marty1885 wip: clean up
f9e92fb3
marty1885 feat: barriers
fe5df20c
marty1885 perf: optimize FA_F16 in ET
e1823591
marty1885 perf: vectorize pack_k_for_transpose16
2562d612
marty1885 perf: prefetch next loop matrix tile
aac1e7c6
marty1885 perf: FlashAttention 2nd MM uses TensorFMA and optimizations
370c06dd
marty1885 cleanup: flashattention reorg
582db50d
marty1885 perf: optimizations and fixes
656f770d
marty1885 feat: L2SCP API and make FlashAttention support DV = 256 for gemma
69b21924
marty1885 perf: parallelize norms beyond single row
24670b8a
marty1885 feat: GATED_DELTA_NET support and relaxed L2_NORM requirment
4db780ed
marty1885 feat: loosen RMS_NORM, NORM, ROPE contingous req too
4ea34780
marty1885 feat: repeat supports brocasting on dim 0 and loosen cont check
e2b8b12c
marty1885 feat: FILL and DIAG operator
243b7bef
marty1885 feat: loosen UNARY support chcek
23530ba0
marty1885 feat: TRI support
043d91a8
marty1885 feat: SOLVE_TRI support
22da6a1e
marty1885 feat: basic SET support
04d62da4
marty1885 feat: loosen CONT req
3fed43db
marty1885 perf: fp16_to_fp32 use ASM
7524b049
marty1885 feat: IMROPE support
28cbb11e
marty1885 feat: PAD support
58f3e1e0
marty1885 feat: global barrier
66237258
marty1885 fix: view must live on the same backend as backing tensor
e378631e
marty1885 feat: relax CONCAT in ET backend
cc7ac95f
marty1885 feat: dead simple CUMSUM implementation
c522cd5a
marty1885 feat: basic SSM_CONV support
d3bd261a
marty1885 feat: loosen CONCAT req
29636c19
marty1885 feat: relax GATED_DELTA_NET and add SET support proper
7a561b03
marty1885 cleanup: cleanup LCM math
6f4aa8b0
marty1885 feat: SWIGLU single input
24ab03c9
marty1885 feat: SSM_SCAN support
fe05d582
marty1885 feat: el_map supports non aligned tensors in best effort
913c266e
marty1885 feat: basic GROUP_NORM support
5b93b8f1
marty1885 feat: loosen MUL_MAT capablities slightly
d5cf7ad9
marty1885 feat: loosen MUL_MAT and GET_ROWS and add IM2COL
faa2678e
marty1885 feat: special case for softmax 1x1x1x1
40ed3563
marty1885 feat: loosen SOFT_MAX req in ET backend
93cdc696
marty1885 fix: el_map unaligned acse fixes
539444cd
marty1885 perf: optimize zero_acc_vec in flash_attn_ext_f16_me
dcedd0d1
marty1885 perf: use hart 1 for packing in MM and FA for FP16
d8621ad5
marty1885 feat: kernel semaphore
0acc4535
marty1885 perf: better instruction sequence in FlashAttention
81493cb4
marty1885 fix: gated_delta_net with proper masking
73f63023
marty1885 perf: better parallelization for GATED_DELTA_NET
865dd091
marty1885 perf: parallelize SSM_CONV over nr
a8b13a45
marty1885 perf: vectorize SSM_CONV
02e7e04f
marty1885 perf: optimize MUL_MAT for q8
6b500e2a
marty1885 Merge remote-tracking branch 'upstream/master' into backend-dev-2
c2d00ab1
marty1885 feat: support Gemma 4
21eeeae8
marty1885 fix: support multi-device
aeb17835
marty1885 feat: broader GLU support
5c08f4a1
marty1885 feat: unary ops supports view
12f2b2b9
marty1885 fix: repair fp16 MM using matrix engine
d13ac818
marty1885 perf: handle large N GEMV better
198f64f8
marty1885 perf: better q8_0 MM
512b23f2
marty1885 perf: better set_rows
06622b55
marty1885 Merge remote-tracking branch 'upstream/master' into backend-dev-2
4effc2be
marty1885 add back deleted files
83ee00bd
marty1885 fix: repair after merge
42c81a0c
marty1885 feat: POC version of uberkernel
143ca8b1
marty1885 feat: RMS_NORM in uberkernel
35e2eece
marty1885 feat: add more kernels into usage
3981a0d0
marty1885 chore: clean up uberkernel compilation
02c3932d
marty1885 perf: faster flash attention
782a26f5
marty1885 perf: opt flash attention for large seq length
c9dd145e
marty1885 feat: loosen op bounds. clamp and mean support
04a9345f
marty1885 perf: vectorize ssm_scan
d0e20751
marty1885 perf: slightly faster FA
9faa00c0
marty1885 perf: FlashAttention parallel MM and load
05f76d0b
marty1885 perf: fuse Q8 MM and ADD
cc76d2f1
marty1885 feat: basic conv kernel for ET
481a28f4
SaqibAkram-10xE softMAx_test
5b14deea
SaqibAkram-10xE set_rows_f32
824f4d91
SaqibAkram-10xE get_rows and cont
d88d5cbc
SaqibAkram-10xE testing
03843058
SaqibAkram-10xE set_rows_exp
363837d4
SaqibAkram-10xE Junk addition
f0b83e26
SaqibAkram-10xE Narrowing the issue
18f4c90a
SaqibAkram-10xE Update flash_attn_ext_f16_me.c
51f4e4d7
SaqibAkram-10xE test
b1ca3e55
SaqibAkram-10xE Eviction updated
56d05448
SaqibAkram-10xE Detailed cache eviction debug
54757483
SaqibAkram-10xE Detailed cache eviction debug
3c3622fc
SaqibAkram-10xE Merge branch 'cpyUberKerenl' of https://github.com/SaqibAkram-10xE/ll…
d1f49a4c
SaqibAkram-10xE mulmat
ff31ba7b
SaqibAkram-10xE removeal of `BUILD_FOR_UBERKERNEL` flag
ddc4078f
SaqibAkram-10xE cleaning...
319fd2dc
RehanQasim-dev feat: implement mul_mat and mul_mat_id for Q4_0 type
7fda43f9
marty1885 fix: balance FCC0 count
1061ab9c
marty1885 Merge remote-tracking branch 'rehan/feat/q4_0' into et-backend-n-way-…
ae0cdc86
marty1885 merge upstream
85d9af78
marty1885 optimize uberkernel plan upload
ae4f2d0e
marty1885 Merge remote-tracking branch 'upstream/master' into et-backend-n-way-…
23376829
marty1885 add mul_mat q4 into uberkernel
5c5b86ff
marty1885 enable gating flush to just uberkernel
ecca77ee
marty1885 update docs for ET
4b9cb8da
marty1885 update op support for ET
7bfbbffa
SaqibAkram-10xE Update uberkernel.c
e36beeff
SaqibAkram-10xE Update unary_f32.c
5f7b8ee2
SaqibAkram-10xE gemma 4
0aaf8a9d
SaqibAkram-10xE bisect gemma4: enable scale_f32 only
e3a7bca9
SaqibAkram-10xE bisect gemma4: +rms_norm_f32
b89b32c5
SaqibAkram-10xE bisect gemma4: +rms_norm_mul_f32
7f849452
SaqibAkram-10xE bisect gemma4: disable rms_norm_mul_f32 -- BREAKS OUTPUT
186e683b
SaqibAkram-10xE bisect gemma4: +rope_f32 (skip rms_norm_mul)
32722bc4
SaqibAkram-10xE bisect gemma4: +el_map_f32
ec652f1d
SaqibAkram-10xE bisect gemma4: +softmax_f32
c49a0d3c
SaqibAkram-10xE bisect gemma4: +get_rows_f32
4eb8306b
SaqibAkram-10xE bisect gemma4: +glu_f32
195e7e2b
SaqibAkram-10xE bisect gemma4: +mul_mat_f32 +mul_mat_f32_matrix_engine
86006f55
SaqibAkram-10xE bisect gemma4: +mul_mat_f16 +mul_mat_f16_matrix_engine
34f5b6aa
SaqibAkram-10xE bisect gemma4: +mul_mat_Q8_0 +mul_mat_Q4_0
eeb48b50
SaqibAkram-10xE bisect gemma4: +flash_attn_ext_f32 +flash_attn_ext_f16_me
55751802
SaqibAkram-10xE bisect gemma4: +mul_mat_id_f32
3f3e672d
SaqibAkram-10xE bisect gemma4: +sum_rows_f32
e2dafbb4
SaqibAkram-10xE bisect gemma4: +cont_f16
49aa7440
SaqibAkram-10xE bisect gemma4: +fill_f32
74005e5a
SaqibAkram-10xE bisect gemma4: +unary_f32 (all ops re-enabled except rms_norm_mul)
b517f925
SaqibAkram-10xE Update rms_norm_mul_f32.c
9d8d0929
SaqibAkram-10xE bisect2 gemma4 n64: +scale_f32 only
5940335d
SaqibAkram-10xE bisect2 gemma4 n64: +rms_norm_f32 +rope_f32
05d55d47
SaqibAkram-10xE bisect2 gemma4 n64: +rms_norm_mul_f32 (with ET_UBERKERNEL eviction fix)
3b77cbd2
SaqibAkram-10xE bisect2 gemma4 n64: +el_map +get_rows +glu +softmax (skip rms_norm_mul)
9b88aa4c
SaqibAkram-10xE bisect2 gemma4 n64: all ops enabled except rms_norm_mul
a13ca220
SaqibAkram-10xE bisect2 n64: test unary+cont+fill+sum_rows (no mul_mat/flash_attn)
b8f520b0
SaqibAkram-10xE bisect2 n64: +mul_mat_f32 +mul_mat_f32_matrix_engine
a4e99131
SaqibAkram-10xE bisect2 n64: +mul_mat_f16 +mul_mat_f16_matrix_engine
d1421041
SaqibAkram-10xE bisect2 n64: +mul_mat_Q8_0 +mul_mat_Q4_0
407638b3
SaqibAkram-10xE bisect2 n64: +mul_mat_Q8_0 only (disable Q4_0)
3b45dc7f
SaqibAkram-10xE bisect2 n64: +mul_mat_Q4_0 only (Q8_0 breaks)
a35a93b6
SaqibAkram-10xE bisect2 n64: +mul_mat_id +flash_attn_ext (skip Q8_0)
73bcb3e4
SaqibAkram-10xE run-3: matmul + rms_norm_mul
f9be6ac3
SaqibAkram-10xE run-4
f16a85ed
SaqibAkram-10xE Revert "run-4"
7c50be84
SaqibAkram-10xE run5
4a8a767f
RehanQasim-dev et-backend: optimize Q4_0 and Q8_0 mul_mat_id row accumulations
c9bf4a7a
RehanQasim-dev et-backend: specialize mul_mat_id kernels for Q4_0 and Q8_0
3463064b
RehanQasim-dev et-backend: fix RoPE YaRN corr_dim formula and handle degenerate inputs
9b2c8607
RehanQasim-dev test-backend-ops: add DeepSeek-V2-Lite RoPE test coverage
23c60dfe
RehanQasim-dev et-backend: add Q4_0 mul_mat matrix-engine kernel using TensorFMA32
d1e3442c
RehanQasim-dev et-backend: vectorize Q4_0 matrix-engine dequantization
45c8051f
RehanQasim-dev et-backend: support hybrid matrix/vector engine execution for Q4_0 mu…
6439ba09
RehanQasim-dev et-backend: run partial-N tiles on matrix engine for Q4_0 mul_mat
f38e2958
RehanQasim-dev et-backend: route Q4_0 mul_mat N < 53 to vecdot for better prefill la…
e8498fca
SaqibAkram-10xE changes after cleanup
79e8ae83
marty1885 Merge remote-tracking branch 'rehan/rehan/patch-v2' into et-backend-n…
7345ddf5
marty1885 Merge remote-tracking branch 'saqib/UK_qwen' into et-backend-n-way-merge
7252f36a
marty1885 merge upstream llama.cpp
ccf28702
marty1885 cleanup before upstream
bef3b9cd
marty1885 sync upstream
7e61d6fc
marty1885 marty1885 requested a review from ggerganov ggerganov 13 days ago
github-actions github-actions added documentation
github-actions github-actions added testing
github-actions github-actions added examples
github-actions github-actions added python
github-actions github-actions added server
github-actions github-actions added ggml
ggml-gh-bot
marty1885 marty1885 force pushed from a50a88b0 to 7e61d6fc 13 days ago
marty1885 remove ai agent residual and extra test files
95b51dca
taronaeo
taronaeo commented on 2026-06-05
taronaeo taronaeo assigned taronaeo taronaeo 13 days ago
marty1885 restrict changes into ET backend
5843508b
marty1885
marty1885 restrict changes into ET backend
35e190ac
taronaeo
taronaeo commented on 2026-06-05
ubergarm
giladgd
giladgd commented on 2026-06-06
marty1885 move kernel embedding from Python to CMake
6faafd8e
marty1885 move uberkernel gen into CMake
c15de475
marty1885 apply clang format
fa42ee42
marty1885 update CMake style
3f0c9979
marty1885 update to match C and C++ style
e3460f3c
taronaeo
taronaeo commented on 2026-06-10
marty1885 use source ggml and quant headers instead of ET's
02686956
taronaeo
taronaeo commented on 2026-06-15

Login to write a write a comment.

Login via GitHub

Assignees
Labels
Milestone