pytorch
8ccfd801 - Introduce CUDA-only `_scaled_mm` op (#107341)

Commit View On GitHub

Commit

1 year ago

Introduce CUDA-only `_scaled_mm` op (#107341) Summary: Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8 According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed. Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix. See table below for supported input and output types: | Mat1 type | Mat2 type | Bias type | Output types | | ----------- | ----------- | ----------- | ----------- | | Float8_e4m3 | Float8_e4m3 | Float16 | Float8_e4m3, Float16 | | Float8_e4m3 | Float8_e4m3 | BFloat16 | Float8_e4m3, BFloat16, Float | | Float8_e5m2 | Float8_e4m3 | Float16 | Float8_e4m3, Float8_e5m2, Float16 | | Float8_e5m2 | Float8_e4m3 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float | | Float8_e4m3 | Float8_e5m2 | Float16 | Float8_e4m3, Float8_e5m2, Float16 | | Float8_e4m3 | Float8_e5m2 | BFloat16 | Float8_e4m3, Float8_e5m2, BFloat16, Float | | Float8_e4m3 | Float8_e5m2 | Not supported | Not supported | Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following: ```python register_decomposition(aten._scaled_mm) def _scaled_mm( mat1: Tensor, mat2: Tensor, *, dtype: Optional[torch.dtype] = None, scale_a: Optional[Tensor] = None, scale_b: Optional[Tensor] = None, scale_result: Optional[Tensor] = None, ) -> Tuple[Tensor, Tensor]: rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32)) rc = scale_a * rc if scale_a is not None else rc rc = scale_b * rc if scale_b is not None else rc rc = scale_result * rc if scale_result is not None else rc rc = rc.to(dtype if dtype is not None else mat1.dtype) return rc, torch.tensor(0.0, device=mat1.device) ``` Known limitations: - Only works for matrix sizes divisible by 16 - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work) Test Plan: Tests in test_matmul_cda.py Differential Revision: D48415871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341 Approved by: https://github.com/vkuzo

Author

drisspg

Committer

pytorchmergebot

Parents

2e44adb0

pytorch 8ccfd801 - Introduce CUDA-only `_scaled_mm` op (#107341)

Commit

pytorch
8ccfd801 - Introduce CUDA-only `_scaled_mm` op (#107341)