Fix ROCm BF16 conversion intrinsics in inference v2 (#7843) (#7846)
Fixes #7843
On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in
the code are not provided, e.g.:
- `__ll2bfloat16_rn`
- `__int2bfloat16_rn`
- `__short2bfloat16_rn`
- `__bfloat162uint_rn`
This causes compilation errors on HIP platforms.
This PR introduces fallback paths using functions available on HIP
platform mirroring the [conversion util in
csrc](https://github.com/deepspeedai/DeepSpeed/blob/2c362837b0ef906ea7e7506bab3a625faa945cdd/csrc/includes/conversion_utils.h#L351).
The converion paths are:
- int/uint -> bf16: convert to float (or double for 64-bit), then to
bf16.
- bf16 -> int/uint: convert bf16 to float, then to the integer type.
- float -> bf16: build from bf16 via supported HIP helpers.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>