vllm
5284a65b - Refactor ModelOptFp8MoEMethod to use modular kernels

Commit

131 days ago

Refactor ModelOptFp8MoEMethod to use modular kernels Apply the modular kernel refactoring pattern from PR #30825 to ModelOptFp8MoEMethod. This enables cleaner kernel dispatch and better support for tensor parallel configurations. Changes: - The select_gemm_impl() method already existed and correctly returns CUTLASS experts via select_cutlass_fp8_gemm_impl() - Simplified apply() method to only handle TENSORRT_LLM path, which doesn't use modular kernel - Removed unreachable CUTLASS and fallback paths from apply() since they are now handled by FusedMoEModularMethod - Removed unused import flashinfer_cutlass_moe_fp8 The modular kernel infrastructure now handles CUTLASS and TP/EP backends by wrapping the quant method with FusedMoEModularMethod, which calls select_gemm_impl() to get the appropriate experts implementation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Robert Shaw <robshaw@redhat.com>

References

refactor-modelopt-fp8-modular-kernel

Author

Robert Shaw

Parents

fc6fa84b

vllm 5284a65b - Refactor ModelOptFp8MoEMethod to use modular kernels

vllm
5284a65b - Refactor ModelOptFp8MoEMethod to use modular kernels