sequence parallel default dtype (#7364)
the newly released nccl finally started to use fp32 accumulation for
reduction ops!
* Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
the accuracy with fp8 and fp16 data types should be much improved.
https://github.com/NVIDIA/nccl/commit/72d2432094d6ae36abd6e511c3a16a2d052dbf94
So we should change the fp32 comms default for SP to the same dtype as
inputs if `nccl>=2.27.3` - the user can still override the default.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>