feat: Add Triton-optimized Muon optimizer implementation
- Implement triton_muon.py with fallback to PyTorch when Triton is not available
- Add smart selection between Triton and PyTorch implementations based on matrix size
- Maintain backward compatibility with existing code
- Add performance-aware matrix size detection for optimal kernel selection
Signed-off-by: Guokai Ma <guokai.ma@gmail.com>