[NVPTX] Support for dense and sparse MMA intrinsics with block scaling. (#163561)
This change adds dense and sparse MMA intrinsics with block scaling. The
implementation is based on [PTX ISA version
9.0](https://docs.nvidia.com/cuda/parallel-thread-execution/). Tests for
new intrinsics are added for PTX 8.7 and SM 120a and are generated by
`llvm/test/CodeGen/NVPTX/wmma-ptx87-sm120a.py`. The tests have been
verified with ptxas from CUDA-13.0 release.
Dense MMA intrinsics with block scaling were supported by
@schwarzschild-radius.