metal: SSM_SCAN performance #14743
feat: Add s_off as a parameter in the args struct
ba74a247
perf: Parallelize mamba2 SSM_SCAN metal kernel over d_state
8d5a25d3
gabe-l-hart
changed the title metail: SSM_SCAN performance metal: SSM_SCAN performance 51 days ago
gabe-l-hart
force pushed
from
26524d08
to
8d5a25d3
51 days ago
fix: Update logic to correctly do the multi-layer parallel sum
e16e24be
fix: Correctly size the shared memory bufer and assert expected size …
21db0b59
gabe-l-hart
force pushed
from
0817add1
to
21db0b59
51 days ago
refactor: Compute block offsets once rather than once per token
a5334f91
feat: Use local variable for state recursion
3866f766
feat: Use a secondary simd_sum instead of a for loop
641276a8
feat: Add assertion and comment about relationship between simd size …
d06d0876
feat: Parallelize of d_state for mamba-1
80545ef5
feat: Parallel sum in SSM_CONV
16bc0596
Revert "feat: Parallel sum in SSM_CONV"
e55176a0
Merge remote-tracking branch 'origin/master' into GraniteFourPerf
f6d5e1ae
ggerganov
approved these changes
on 2025-07-25
Merge remote-tracking branch 'origin/master' into GraniteFourPerf
c3711e1d
refactor: Simplify shared memory sizing
d20b02d1
gabe-l-hart
deleted the GraniteFourPerf branch 25 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub