PR #24179 Initial ET backend

ggml-et: Add MUL_MAT_ID kernel

f4f00f10

ggml-et: Build et kernels as part of ggml

7614dc85

ggml-et: Embed kernels with fs fallback

0100c79f

ggml-et: Build fixes

0a7a3cee

ggml-et: Add MUL_MAT F32xF32 op

801c5eab

ggml_et: Add MUL_MAT_ID op

c61c2fce

ggml-et: Disable offloading for debug

71d56529

ggml-et: Refactor out block ops

e74bf85c

ggml-et: ggml backend API changes

fa5494fc

ggml-et: Add RESHAPE/TRANSPOSE to supported

afcbd878

ggml-et: Add CONT_F16

e90637c6

ggml-et: Add supported ops doc

944804f6

gglm-et: Initial doc

51b23a39

ggml-et: Remove runtime import hacks

fd889d95

ggml-et: Fix GET_ROWS kernel

6dc9a8ca

ggml-et: Fix SET_ROWS kernel

2512acff

ggml-et: Use custom instruction for fp32->fp16

e9bb5311

ggml-et: Vectorize set_rows fp32->fp16

3bc8f5a0

ggml-et: Fix ROPE kernel (yarn)

b3bbea7d

ggml-et: Better sinf

b550d35b

ggml-et: Fix SOFT_MAX

8d91acab

ggml-et: Fix CONT

7ddf862c

ggml-et: Fix elmap kernel

22898d2c

ggml-et: Fix MUL_MAT MUL_MAT_ID remainders

4985ba59

ggml-et: Fix ET-SOC reference

b82b1024

ggml-et: Fix embed kernels scripts for old python

4f027eea

Merge pull request #3 from glguida/fix_old_python

756d5a17

Add sysemu support with compile time flag `-DGGML_ET_SYSEMU=ON` (#6)

f36eb795

build: proper dep tracking for kernels

c58ef523

support host using MOLD linker

87b4e717

initial multi core GET_ROW F32 implementation

9758af4d

vectorized q8 dequant

26b0627b

wip: cland warning clenaups and initial logging refactor

08eecf7b

wip: message default message cleanup

4420c99e

chore: message cleanups

3610ae62

cmake cleanup

127ba8d7

migrate to use platform provided functions

8832c224

cmake back into subdir

f339ffe0

support et_print() in kernels

8a721243

fix: repair kernel building

e836cdb9

perf: operations run async by default

5f6c5715

debug: proper kernel dep tracking and error detection on kenrel launch

bd8d500a

fix: kernel binary dep tracking and fixing get_rows_f32 erroring

e10f730c

perf: back to doing async kernel runs by default

4112bfff

perf: vectorize and parallel device memset

61dc924e

merge matmul work

cc0d09c7

merge upstream

a04500a1

misc: align allocation and enable all offload

9d1525d1

misc: delete deadcode and respect memory limits

728f9f9f

fix: repair tensor debug print

fe4d7cd0

fix: loosen RMS_NORM op percision

96816df6

feat: Q4_0 GET_ROWS

d76c3f8e

perf: FP32 MUL_MAT using TensorFMA

faf61dcc

update limitations

c0d7a1fe

perf: redue L1 load in compute_block_dot_product_q8_0

28cc52b2

feat: save kernel mapping (name to id) when profiling is enabled

a3482349

chore: memops cleanup

e553ffa9

perf: parallelize softmax by rows

7a2672fb

perf: vectorize 2nd phase of softmax

0682f059

Merge remote-tracking branch 'upstream/master' into backend-dev

c578ba9f

perf: ban GET_ROWS from offloaded

f1e177cc

perf: vectorize and non-atomic for eltwise ops and sub support

2a3fc321

perf: vectorize normal rope

4db6a1f2

perf: glu runs in parallel

fd6fa6ed

merge: manually merge saqib's work on kernel fixes

5b26a727

perf: more vectorized RoPE

42618f0d

perf: parallelize mul_mat_id

3d987840

perf: parallelize set_rows_f32

5b031145

perf: vectorize softmax

dd3ada94

feat: support kernel fusion and fuse RMS_NORM + MUL

48b86425

fix: mostly resolve test-backend-ops failure in SOFT_MAX and ROPE

3b4f0d20

fix: bump max rope dims for gemma

a42b1c81

feat: GeGLU and SCALE support to fully offload Gemma

46c54843

perf: faster device memset

12272a7d

feat: get_rows supporting Q4_K and avoid cont cache coherent issues

ed0f3a01

merge: merge upstream llama.cpp

07519b78

better F32 MM

86dfc6dd

feat: NORM for ET backend

cbc46348

feat: SQR for ET backend

b8f2f741

feat: UNARY on ET

2515f776

feat: el_map support broadcasting for ET

d8e6161d

feat: SUM_ROWS in ET backend

cf76d60f

feat: more ops in ET backend

2b28f90c

feat: WKV* operators in ET backend

42f866cf

perf: parallelize operators across cacheline instead of row

13fbf78a

perf: parallelize get_rows on cacheline

61fc7232

wip: baseline FlashAttention for ET backend

55952bb3

wip: enough FA and CPY f32->f16 to run llama 3.1 fully offloaded with…

27d228bb

feat: f16 x f16 -> f32 MM using matrix engine

806f4d67

wip: f16 FlashAttention using matrix engine

a7d7a786

wip: clean up

f9e92fb3

feat: barriers

fe5df20c

perf: optimize FA_F16 in ET

e1823591

perf: vectorize pack_k_for_transpose16

2562d612

perf: prefetch next loop matrix tile

aac1e7c6

perf: FlashAttention 2nd MM uses TensorFMA and optimizations

370c06dd

cleanup: flashattention reorg

582db50d

perf: optimizations and fixes

656f770d

feat: L2SCP API and make FlashAttention support DV = 256 for gemma

69b21924

perf: parallelize norms beyond single row

24670b8a

feat: GATED_DELTA_NET support and relaxed L2_NORM requirment

4db780ed

feat: loosen RMS_NORM, NORM, ROPE contingous req too

4ea34780

feat: repeat supports brocasting on dim 0 and loosen cont check

e2b8b12c

feat: FILL and DIAG operator

243b7bef

feat: loosen UNARY support chcek

23530ba0

feat: TRI support

043d91a8

feat: SOLVE_TRI support

22da6a1e

feat: basic SET support

04d62da4

feat: loosen CONT req

3fed43db

perf: fp16_to_fp32 use ASM

7524b049

feat: IMROPE support

28cbb11e

feat: PAD support

58f3e1e0

feat: global barrier

66237258

fix: view must live on the same backend as backing tensor

e378631e

feat: relax CONCAT in ET backend

cc7ac95f

feat: dead simple CUMSUM implementation

c522cd5a

feat: basic SSM_CONV support

d3bd261a

feat: loosen CONCAT req

29636c19

feat: relax GATED_DELTA_NET and add SET support proper

7a561b03

cleanup: cleanup LCM math

6f4aa8b0

feat: SWIGLU single input

24ab03c9

feat: SSM_SCAN support

fe05d582

feat: el_map supports non aligned tensors in best effort

913c266e

feat: basic GROUP_NORM support

5b93b8f1

feat: loosen MUL_MAT capablities slightly

d5cf7ad9

feat: loosen MUL_MAT and GET_ROWS and add IM2COL

faa2678e

feat: special case for softmax 1x1x1x1

40ed3563

feat: loosen SOFT_MAX req in ET backend

93cdc696

fix: el_map unaligned acse fixes

539444cd

perf: optimize zero_acc_vec in flash_attn_ext_f16_me

dcedd0d1

perf: use hart 1 for packing in MM and FA for FP16

d8621ad5

feat: kernel semaphore

0acc4535

perf: better instruction sequence in FlashAttention

81493cb4

fix: gated_delta_net with proper masking

73f63023

perf: better parallelization for GATED_DELTA_NET

865dd091

perf: parallelize SSM_CONV over nr

a8b13a45

perf: vectorize SSM_CONV

02e7e04f

perf: optimize MUL_MAT for q8

6b500e2a

Merge remote-tracking branch 'upstream/master' into backend-dev-2

c2d00ab1

feat: support Gemma 4

21eeeae8

fix: support multi-device

aeb17835

feat: broader GLU support

5c08f4a1

feat: unary ops supports view

12f2b2b9

fix: repair fp16 MM using matrix engine

d13ac818

perf: handle large N GEMV better

198f64f8

perf: better q8_0 MM

512b23f2

perf: better set_rows

06622b55

Merge remote-tracking branch 'upstream/master' into backend-dev-2

4effc2be

add back deleted files

83ee00bd

fix: repair after merge

42c81a0c

feat: POC version of uberkernel

143ca8b1

feat: RMS_NORM in uberkernel

35e2eece

feat: add more kernels into usage

3981a0d0

chore: clean up uberkernel compilation

02c3932d

perf: faster flash attention

782a26f5

perf: opt flash attention for large seq length

c9dd145e

feat: loosen op bounds. clamp and mean support

04a9345f

perf: vectorize ssm_scan

d0e20751

perf: slightly faster FA

9faa00c0

perf: FlashAttention parallel MM and load

05f76d0b

perf: fuse Q8 MM and ADD

cc76d2f1

feat: basic conv kernel for ET

481a28f4

softMAx_test

5b14deea

set_rows_f32

824f4d91

get_rows and cont

d88d5cbc

testing

03843058

set_rows_exp

363837d4

Junk addition

f0b83e26

Narrowing the issue

18f4c90a

Update flash_attn_ext_f16_me.c

51f4e4d7

test

b1ca3e55

Eviction updated

56d05448

Detailed cache eviction debug

54757483

Detailed cache eviction debug

3c3622fc

Merge branch 'cpyUberKerenl' of https://github.com/SaqibAkram-10xE/ll…

d1f49a4c

mulmat

ff31ba7b

removeal of `BUILD_FOR_UBERKERNEL` flag

ddc4078f

cleaning...

319fd2dc

feat: implement mul_mat and mul_mat_id for Q4_0 type

7fda43f9

fix: balance FCC0 count

1061ab9c

Merge remote-tracking branch 'rehan/feat/q4_0' into et-backend-n-way-…

ae0cdc86

merge upstream

85d9af78

optimize uberkernel plan upload

ae4f2d0e

Merge remote-tracking branch 'upstream/master' into et-backend-n-way-…

23376829

add mul_mat q4 into uberkernel

5c5b86ff

enable gating flush to just uberkernel

ecca77ee

update docs for ET

4b9cb8da

update op support for ET

7bfbbffa

Update uberkernel.c

e36beeff

Update unary_f32.c

5f7b8ee2

gemma 4

0aaf8a9d

bisect gemma4: enable scale_f32 only

e3a7bca9

bisect gemma4: +rms_norm_f32

b89b32c5

bisect gemma4: +rms_norm_mul_f32

7f849452

bisect gemma4: disable rms_norm_mul_f32 -- BREAKS OUTPUT

186e683b

bisect gemma4: +rope_f32 (skip rms_norm_mul)

32722bc4

bisect gemma4: +el_map_f32

ec652f1d

bisect gemma4: +softmax_f32

c49a0d3c

bisect gemma4: +get_rows_f32

4eb8306b

bisect gemma4: +glu_f32

195e7e2b

bisect gemma4: +mul_mat_f32 +mul_mat_f32_matrix_engine

86006f55

bisect gemma4: +mul_mat_f16 +mul_mat_f16_matrix_engine

34f5b6aa

bisect gemma4: +mul_mat_Q8_0 +mul_mat_Q4_0

eeb48b50

bisect gemma4: +flash_attn_ext_f32 +flash_attn_ext_f16_me

55751802

bisect gemma4: +mul_mat_id_f32

3f3e672d

bisect gemma4: +sum_rows_f32

e2dafbb4

bisect gemma4: +cont_f16

49aa7440

bisect gemma4: +fill_f32

74005e5a

bisect gemma4: +unary_f32 (all ops re-enabled except rms_norm_mul)

b517f925

Update rms_norm_mul_f32.c

9d8d0929

bisect2 gemma4 n64: +scale_f32 only

5940335d

bisect2 gemma4 n64: +rms_norm_f32 +rope_f32

05d55d47

bisect2 gemma4 n64: +rms_norm_mul_f32 (with ET_UBERKERNEL eviction fix)

3b77cbd2

bisect2 gemma4 n64: +el_map +get_rows +glu +softmax (skip rms_norm_mul)

9b88aa4c

bisect2 gemma4 n64: all ops enabled except rms_norm_mul

a13ca220

bisect2 n64: test unary+cont+fill+sum_rows (no mul_mat/flash_attn)

b8f520b0

bisect2 n64: +mul_mat_f32 +mul_mat_f32_matrix_engine

a4e99131

bisect2 n64: +mul_mat_f16 +mul_mat_f16_matrix_engine

d1421041

bisect2 n64: +mul_mat_Q8_0 +mul_mat_Q4_0

407638b3

bisect2 n64: +mul_mat_Q8_0 only (disable Q4_0)

3b45dc7f

bisect2 n64: +mul_mat_Q4_0 only (Q8_0 breaks)

a35a93b6

bisect2 n64: +mul_mat_id +flash_attn_ext (skip Q8_0)

73bcb3e4

run-3: matmul + rms_norm_mul

f9be6ac3

run-4

f16a85ed

Revert "run-4"

7c50be84

run5

4a8a767f

et-backend: optimize Q4_0 and Q8_0 mul_mat_id row accumulations

c9bf4a7a

et-backend: specialize mul_mat_id kernels for Q4_0 and Q8_0

3463064b

et-backend: fix RoPE YaRN corr_dim formula and handle degenerate inputs

9b2c8607

test-backend-ops: add DeepSeek-V2-Lite RoPE test coverage

23c60dfe

et-backend: add Q4_0 mul_mat matrix-engine kernel using TensorFMA32

d1e3442c

et-backend: vectorize Q4_0 matrix-engine dequantization

45c8051f

et-backend: support hybrid matrix/vector engine execution for Q4_0 mu…

6439ba09

et-backend: run partial-N tiles on matrix engine for Q4_0 mul_mat

f38e2958

et-backend: route Q4_0 mul_mat N < 53 to vecdot for better prefill la…

e8498fca

changes after cleanup

79e8ae83

Merge remote-tracking branch 'rehan/rehan/patch-v2' into et-backend-n…

7345ddf5

Merge remote-tracking branch 'saqib/UK_qwen' into et-backend-n-way-merge

7252f36a

merge upstream llama.cpp

ccf28702

cleanup before upstream

bef3b9cd

sync upstream

7e61d6fc

marty1885 requested a review from

ggerganov 13 days ago

github-actions added documentation

github-actions added testing

github-actions added examples

github-actions added python

github-actions added server

github-actions added ggml

marty1885 force pushed from a50a88b0 to 7e61d6fc 13 days ago

remove ai agent residual and extra test files

95b51dca

taronaeo commented on 2026-06-05

taronaeo assigned

taronaeo 13 days ago

restrict changes into ET backend

5843508b

restrict changes into ET backend

35e190ac

taronaeo commented on 2026-06-05

giladgd commented on 2026-06-06

move kernel embedding from Python to CMake

6faafd8e

move uberkernel gen into CMake

c15de475

apply clang format

fa42ee42

update CMake style

3f0c9979

update to match C and C++ style

e3460f3c

taronaeo commented on 2026-06-10

use source ggml and quant headers instead of ET's

02686956

taronaeo commented on 2026-06-15

llama.cpp
Initial ET backend
#24179

Open

Initial ET backend #24179

llama.cpp Initial ET backend #24179 Open

Initial ET backend #24179

llama.cpp
Initial ET backend
#24179

Open