sequence parallel default dtype (#7364)

Commit

323 days ago

sequence parallel default dtype (#7364) the newly released nccl finally started to use fp32 accumulation for reduction ops! * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. https://github.com/NVIDIA/nccl/commit/72d2432094d6ae36abd6e511c3a16a2d052dbf94 So we should change the fp32 comms default for SP to the same dtype as inputs if `nccl>=2.27.3` - the user can still override the default. --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>

References

#7364 - sequence parallel default dtype

Author

stas00

Parents

22cf1a44

DeepSpeed d3b9cb8c - sequence parallel default dtype (#7364)

DeepSpeed
d3b9cb8c - sequence parallel default dtype (#7364)