[clang][NVPTX] Add support for mixed-precision FP arithmetic (#168359)
This change adds support for mixed precision floating point
arithmetic for `f16` and `bf16` where the following patterns:
```
%fh = fpext half %h to float
%resfh = fp-operation(%fh, ...)
...
%fb = fpext bfloat %b to float
%resfb = fp-operation(%fb, ...)
where the fp-operation can be any of:
- fadd
- fsub
- llvm.fma.f32
- llvm.nvvm.add(/fma).*
```
are lowered to the corresponding mixed precision instructions which
combine the conversion and operation into one instruction from
`sm_100` onwards.
This also adds the following intrinsics to complete support for
all variants of the floating point `add/fma` operations in order
to support the corresponding mixed-precision instructions:
- `llvm.nvvm.add.(rn/rz/rm/rp){.ftz}.sat.f`
- `llvm.nvvm.fma.(rn/rz/rm/rp){.ftz}.sat.f`
We lower `fneg` followed by one of the above addition
intrinsics to the corresponding `sub` instruction.
Tests are added in `fp-arith-sat.ll` , `fp-fold-sub.ll`, and
`bultins-nvptx.c`
for the newly added intrinsics and builtins, and in
`mixed-precision-fp.ll`
for the mixed precision instructions.
PTX spec reference for mixed precision instructions:
https://docs.nvidia.com/cuda/parallel-thread-execution/#mixed-precision-floating-point-instructions