[MoE Refactor][3/N] Use Modular Kernels for ModelOpt FP8
Following the pattern from #30825, refactor ModelOptFp8MoEMethod to use
the modular kernels interface for tensor parallel cases.
Changes:
- Initialize mk.FusedMoEModularKernel in process_weights_after_loading()
- FlashInfer CUTLASS: FlashInferAllGatherMoEPrepareAndFinalize + FlashInferExperts
- Triton/DeepGemm: MoEPrepareAndFinalizeNoEP + TritonOrDeepGemmExperts
- Simplify apply() to use unified self.kernel() call
- Remove direct flashinfer_cutlass_moe_fp8 import and calls
- Keep TensorRT-LLM path unchanged (TODO to convert to MK later)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>