[Performance] FP8 Grouped and Batched Matmuls (#44231)
* simplify
* finegrained fp8 moe forwards
* optimized fp8 fused, batched and grouped paths
* fix
* wrap triton
* fix calls
* fix
* remove fused quant kernel (litlle gain and unnecessary) and use torch library wrappers for better torch compileability
* use kernels
* fix
* no need to wrap cutlass
* cleanup
* fix
* added non gated experts support
* remove comments
* style
* fix
* Update src/transformers/quantizers/quantizer_finegrained_fp8.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update finegrained_fp8.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* per tensor scaling support
* use custom fp8 interface
* document
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>