[Performance] FP8 Grouped and Batched Matmuls #44231
simplify
1984e5da
finegrained fp8 moe forwards
b1fcbd80
optimized fp8 fused, batched and grouped paths
12b05465
Merge branch 'main' into fp8-grouped-mm
f47040fe
fix
84e9ef21
wrap triton
94e4cd79
fix calls
98475580
fix
2aa637b5
Merge branch 'main' into fp8-grouped-mm
57e47798
remove fused quant kernel (litlle gain and unnecessary) and use torch…
125d8f4e
use kernels
a2e7dd12
fix
71a1b8c2
no need to wrap cutlass
5c33299d
cleanup
9212cc37
fix
ffe79316
Merge branch 'main' into fp8-grouped-mm
3b9e9f6c
Merge branch 'main' into fp8-grouped-mm
fef6f359
added non gated experts support
25aedb2c
remove comments
7e7e2ac7
style
6c6e1768
fix
4ab554db
Update src/transformers/quantizers/quantizer_finegrained_fp8.py
8243a429
Update finegrained_fp8.py
77dde4e6
per tensor scaling support
3802cd43
SunMarc
approved these changes
on 2026-03-05
use custom fp8 interface
6fa940f0
document
eca2f01b
Merge branch 'main' into fp8-grouped-mm
c3107a90
SunMarc
approved these changes
on 2026-03-10
SunMarc
enabled auto-merge 98 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub