llama.cpp
Initial ET backend
#24179
Open
Go
Login via GitHub
Home
Pricing
FAQ
Install
Login
via GitHub
Overview
Commits
299
Changes
View On
GitHub
Initial ET backend
#24179
marty1885
wants to merge 299 commits into
ggml-org:master
from
marty1885:et-backend-n-way-merge
ggml-et: Add MUL_MAT_ID kernel
f4f00f10
ggml-et: Build et kernels as part of ggml
7614dc85
ggml-et: Embed kernels with fs fallback
0100c79f
ggml-et: Build fixes
0a7a3cee
ggml-et: Add MUL_MAT F32xF32 op
801c5eab
ggml_et: Add MUL_MAT_ID op
c61c2fce
ggml-et: Disable offloading for debug
71d56529
ggml-et: Refactor out block ops
e74bf85c
ggml-et: ggml backend API changes
fa5494fc
ggml-et: Add RESHAPE/TRANSPOSE to supported
afcbd878
ggml-et: Add CONT_F16
e90637c6
ggml-et: Add supported ops doc
944804f6
gglm-et: Initial doc
51b23a39
ggml-et: Remove runtime import hacks
fd889d95
ggml-et: Fix GET_ROWS kernel
6dc9a8ca
ggml-et: Fix SET_ROWS kernel
2512acff
ggml-et: Use custom instruction for fp32->fp16
e9bb5311
ggml-et: Vectorize set_rows fp32->fp16
3bc8f5a0
ggml-et: Fix ROPE kernel (yarn)
b3bbea7d
ggml-et: Better sinf
b550d35b
ggml-et: Fix SOFT_MAX
8d91acab
ggml-et: Fix CONT
7ddf862c
ggml-et: Fix elmap kernel
22898d2c
ggml-et: Fix MUL_MAT MUL_MAT_ID remainders
4985ba59
ggml-et: Fix ET-SOC reference
b82b1024
ggml-et: Fix embed kernels scripts for old python
4f027eea
Merge pull request #3 from glguida/fix_old_python
756d5a17
Add sysemu support with compile time flag `-DGGML_ET_SYSEMU=ON` (#6)
f36eb795
build: proper dep tracking for kernels
c58ef523
support host using MOLD linker
87b4e717
initial multi core GET_ROW F32 implementation
9758af4d
vectorized q8 dequant
26b0627b
wip: cland warning clenaups and initial logging refactor
08eecf7b
wip: message default message cleanup
4420c99e
chore: message cleanups
3610ae62
cmake cleanup
127ba8d7
migrate to use platform provided functions
8832c224
cmake back into subdir
f339ffe0
support et_print() in kernels
8a721243
fix: repair kernel building
e836cdb9
perf: operations run async by default
5f6c5715
debug: proper kernel dep tracking and error detection on kenrel launch
bd8d500a
fix: kernel binary dep tracking and fixing get_rows_f32 erroring
e10f730c
perf: back to doing async kernel runs by default
4112bfff
perf: vectorize and parallel device memset
61dc924e
merge matmul work
cc0d09c7
merge upstream
a04500a1
misc: align allocation and enable all offload
9d1525d1
misc: delete deadcode and respect memory limits
728f9f9f
fix: repair tensor debug print
fe4d7cd0
fix: loosen RMS_NORM op percision
96816df6
feat: Q4_0 GET_ROWS
d76c3f8e
perf: FP32 MUL_MAT using TensorFMA
faf61dcc
update limitations
c0d7a1fe
perf: redue L1 load in compute_block_dot_product_q8_0
28cc52b2
feat: save kernel mapping (name to id) when profiling is enabled
a3482349
chore: memops cleanup
e553ffa9
perf: parallelize softmax by rows
7a2672fb
perf: vectorize 2nd phase of softmax
0682f059
Merge remote-tracking branch 'upstream/master' into backend-dev
c578ba9f
perf: ban GET_ROWS from offloaded
f1e177cc
perf: vectorize and non-atomic for eltwise ops and sub support
2a3fc321
perf: vectorize normal rope
4db6a1f2
perf: glu runs in parallel
fd6fa6ed
merge: manually merge saqib's work on kernel fixes
5b26a727
perf: more vectorized RoPE
42618f0d
perf: parallelize mul_mat_id
3d987840
perf: parallelize set_rows_f32
5b031145
perf: vectorize softmax
dd3ada94
feat: support kernel fusion and fuse RMS_NORM + MUL
48b86425
fix: mostly resolve test-backend-ops failure in SOFT_MAX and ROPE
3b4f0d20
fix: bump max rope dims for gemma
a42b1c81
feat: GeGLU and SCALE support to fully offload Gemma
46c54843
perf: faster device memset
12272a7d
feat: get_rows supporting Q4_K and avoid cont cache coherent issues
ed0f3a01
merge: merge upstream llama.cpp
07519b78
better F32 MM
86dfc6dd
feat: NORM for ET backend
cbc46348
feat: SQR for ET backend
b8f2f741
feat: UNARY on ET
2515f776
feat: el_map support broadcasting for ET
d8e6161d
feat: SUM_ROWS in ET backend
cf76d60f
feat: more ops in ET backend
2b28f90c
feat: WKV* operators in ET backend
42f866cf
perf: parallelize operators across cacheline instead of row
13fbf78a
perf: parallelize get_rows on cacheline
61fc7232
wip: baseline FlashAttention for ET backend
55952bb3
wip: enough FA and CPY f32->f16 to run llama 3.1 fully offloaded with…
27d228bb
feat: f16 x f16 -> f32 MM using matrix engine
806f4d67
wip: f16 FlashAttention using matrix engine
a7d7a786
wip: clean up
f9e92fb3
feat: barriers
fe5df20c
perf: optimize FA_F16 in ET
e1823591
perf: vectorize pack_k_for_transpose16
2562d612
perf: prefetch next loop matrix tile
aac1e7c6
perf: FlashAttention 2nd MM uses TensorFMA and optimizations
370c06dd
cleanup: flashattention reorg
582db50d
perf: optimizations and fixes
656f770d
feat: L2SCP API and make FlashAttention support DV = 256 for gemma
69b21924
perf: parallelize norms beyond single row
24670b8a
feat: GATED_DELTA_NET support and relaxed L2_NORM requirment
4db780ed
feat: loosen RMS_NORM, NORM, ROPE contingous req too
4ea34780
feat: repeat supports brocasting on dim 0 and loosen cont check
e2b8b12c
feat: FILL and DIAG operator
243b7bef
feat: loosen UNARY support chcek
23530ba0
feat: TRI support
043d91a8
feat: SOLVE_TRI support
22da6a1e
feat: basic SET support
04d62da4
feat: loosen CONT req
3fed43db
perf: fp16_to_fp32 use ASM
7524b049
feat: IMROPE support
28cbb11e
feat: PAD support
58f3e1e0
feat: global barrier
66237258
fix: view must live on the same backend as backing tensor
e378631e
feat: relax CONCAT in ET backend
cc7ac95f
feat: dead simple CUMSUM implementation
c522cd5a
feat: basic SSM_CONV support
d3bd261a
feat: loosen CONCAT req
29636c19
feat: relax GATED_DELTA_NET and add SET support proper
7a561b03
cleanup: cleanup LCM math
6f4aa8b0
feat: SWIGLU single input
24ab03c9
feat: SSM_SCAN support
fe05d582
feat: el_map supports non aligned tensors in best effort
913c266e
feat: basic GROUP_NORM support
5b93b8f1
feat: loosen MUL_MAT capablities slightly
d5cf7ad9
feat: loosen MUL_MAT and GET_ROWS and add IM2COL
faa2678e
feat: special case for softmax 1x1x1x1
40ed3563
feat: loosen SOFT_MAX req in ET backend
93cdc696
fix: el_map unaligned acse fixes
539444cd
perf: optimize zero_acc_vec in flash_attn_ext_f16_me
dcedd0d1
perf: use hart 1 for packing in MM and FA for FP16
d8621ad5
feat: kernel semaphore
0acc4535
perf: better instruction sequence in FlashAttention
81493cb4
fix: gated_delta_net with proper masking
73f63023
perf: better parallelization for GATED_DELTA_NET
865dd091
perf: parallelize SSM_CONV over nr
a8b13a45
perf: vectorize SSM_CONV
02e7e04f
perf: optimize MUL_MAT for q8
6b500e2a
Merge remote-tracking branch 'upstream/master' into backend-dev-2
c2d00ab1
feat: support Gemma 4
21eeeae8
fix: support multi-device
aeb17835
feat: broader GLU support
5c08f4a1
feat: unary ops supports view
12f2b2b9
fix: repair fp16 MM using matrix engine
d13ac818
perf: handle large N GEMV better
198f64f8
perf: better q8_0 MM
512b23f2
perf: better set_rows
06622b55
Merge remote-tracking branch 'upstream/master' into backend-dev-2
4effc2be
add back deleted files
83ee00bd
fix: repair after merge
42c81a0c
feat: POC version of uberkernel
143ca8b1
feat: RMS_NORM in uberkernel
35e2eece
feat: add more kernels into usage
3981a0d0
chore: clean up uberkernel compilation
02c3932d
perf: faster flash attention
782a26f5
perf: opt flash attention for large seq length
c9dd145e
feat: loosen op bounds. clamp and mean support
04a9345f
perf: vectorize ssm_scan
d0e20751
perf: slightly faster FA
9faa00c0
perf: FlashAttention parallel MM and load
05f76d0b
perf: fuse Q8 MM and ADD
cc76d2f1
feat: basic conv kernel for ET
481a28f4
softMAx_test
5b14deea
set_rows_f32
824f4d91
get_rows and cont
d88d5cbc
testing
03843058
set_rows_exp
363837d4
Junk addition
f0b83e26
Narrowing the issue
18f4c90a
Update flash_attn_ext_f16_me.c
51f4e4d7
test
b1ca3e55
Eviction updated
56d05448
Detailed cache eviction debug
54757483
Detailed cache eviction debug
3c3622fc
Merge branch 'cpyUberKerenl' of https://github.com/SaqibAkram-10xE/ll…
d1f49a4c
mulmat
ff31ba7b
removeal of `BUILD_FOR_UBERKERNEL` flag
ddc4078f
cleaning...
319fd2dc
feat: implement mul_mat and mul_mat_id for Q4_0 type
7fda43f9
fix: balance FCC0 count
1061ab9c
Merge remote-tracking branch 'rehan/feat/q4_0' into et-backend-n-way-…
ae0cdc86
merge upstream
85d9af78
optimize uberkernel plan upload
ae4f2d0e
Merge remote-tracking branch 'upstream/master' into et-backend-n-way-…
23376829
add mul_mat q4 into uberkernel
5c5b86ff
enable gating flush to just uberkernel
ecca77ee
update docs for ET
4b9cb8da
update op support for ET
7bfbbffa
Update uberkernel.c
e36beeff
Update unary_f32.c
5f7b8ee2
gemma 4
0aaf8a9d
bisect gemma4: enable scale_f32 only
e3a7bca9
bisect gemma4: +rms_norm_f32
b89b32c5
bisect gemma4: +rms_norm_mul_f32
7f849452
bisect gemma4: disable rms_norm_mul_f32 -- BREAKS OUTPUT
186e683b
bisect gemma4: +rope_f32 (skip rms_norm_mul)
32722bc4
bisect gemma4: +el_map_f32
ec652f1d
bisect gemma4: +softmax_f32
c49a0d3c
bisect gemma4: +get_rows_f32
4eb8306b
bisect gemma4: +glu_f32
195e7e2b
bisect gemma4: +mul_mat_f32 +mul_mat_f32_matrix_engine
86006f55
bisect gemma4: +mul_mat_f16 +mul_mat_f16_matrix_engine
34f5b6aa
bisect gemma4: +mul_mat_Q8_0 +mul_mat_Q4_0
eeb48b50
bisect gemma4: +flash_attn_ext_f32 +flash_attn_ext_f16_me
55751802
bisect gemma4: +mul_mat_id_f32
3f3e672d
bisect gemma4: +sum_rows_f32
e2dafbb4
bisect gemma4: +cont_f16
49aa7440
bisect gemma4: +fill_f32
74005e5a
bisect gemma4: +unary_f32 (all ops re-enabled except rms_norm_mul)
b517f925
Update rms_norm_mul_f32.c
9d8d0929
bisect2 gemma4 n64: +scale_f32 only
5940335d
bisect2 gemma4 n64: +rms_norm_f32 +rope_f32
05d55d47
bisect2 gemma4 n64: +rms_norm_mul_f32 (with ET_UBERKERNEL eviction fix)
3b77cbd2
bisect2 gemma4 n64: +el_map +get_rows +glu +softmax (skip rms_norm_mul)
9b88aa4c
bisect2 gemma4 n64: all ops enabled except rms_norm_mul
a13ca220
bisect2 n64: test unary+cont+fill+sum_rows (no mul_mat/flash_attn)
b8f520b0
bisect2 n64: +mul_mat_f32 +mul_mat_f32_matrix_engine
a4e99131
bisect2 n64: +mul_mat_f16 +mul_mat_f16_matrix_engine
d1421041
bisect2 n64: +mul_mat_Q8_0 +mul_mat_Q4_0
407638b3
bisect2 n64: +mul_mat_Q8_0 only (disable Q4_0)
3b45dc7f
bisect2 n64: +mul_mat_Q4_0 only (Q8_0 breaks)
a35a93b6
bisect2 n64: +mul_mat_id +flash_attn_ext (skip Q8_0)
73bcb3e4
run-3: matmul + rms_norm_mul
f9be6ac3
run-4
f16a85ed
Revert "run-4"
7c50be84
run5
4a8a767f
et-backend: optimize Q4_0 and Q8_0 mul_mat_id row accumulations
c9bf4a7a
et-backend: specialize mul_mat_id kernels for Q4_0 and Q8_0
3463064b
et-backend: fix RoPE YaRN corr_dim formula and handle degenerate inputs
9b2c8607
test-backend-ops: add DeepSeek-V2-Lite RoPE test coverage
23c60dfe
et-backend: add Q4_0 mul_mat matrix-engine kernel using TensorFMA32
d1e3442c
et-backend: vectorize Q4_0 matrix-engine dequantization
45c8051f
et-backend: support hybrid matrix/vector engine execution for Q4_0 mu…
6439ba09
et-backend: run partial-N tiles on matrix engine for Q4_0 mul_mat
f38e2958
et-backend: route Q4_0 mul_mat N < 53 to vecdot for better prefill la…
e8498fca
changes after cleanup
79e8ae83
Merge remote-tracking branch 'rehan/rehan/patch-v2' into et-backend-n…
7345ddf5
Merge remote-tracking branch 'saqib/UK_qwen' into et-backend-n-way-merge
7252f36a
merge upstream llama.cpp
ccf28702
cleanup before upstream
bef3b9cd
sync upstream
7e61d6fc
marty1885
requested a review
from
ggerganov
13 days ago
github-actions
added
documentation
github-actions
added
testing
github-actions
added
examples
github-actions
added
python
github-actions
added
server
github-actions
added
ggml
marty1885
force pushed
from
a50a88b0
to
7e61d6fc
13 days ago
remove ai agent residual and extra test files
95b51dca
taronaeo
commented on 2026-06-05
taronaeo
assigned
taronaeo
13 days ago
restrict changes into ET backend
5843508b
restrict changes into ET backend
35e190ac
taronaeo
commented on 2026-06-05
giladgd
commented on 2026-06-06
move kernel embedding from Python to CMake
6faafd8e
move uberkernel gen into CMake
c15de475
apply clang format
fa42ee42
update CMake style
3f0c9979
update to match C and C++ style
e3460f3c
taronaeo
commented on 2026-06-10
use source ggml and quant headers instead of ET's
02686956
taronaeo
commented on 2026-06-15
Login to write a write a comment.
Login via GitHub
Reviewers
taronaeo
giladgd
ggerganov
Assignees
taronaeo
Labels
documentation
testing
examples
python
server
ggml
Milestone
No milestone
Login to write a write a comment.
Login via GitHub