DeepSpeed
a8b82153 - Optimize the fp-dequantizer to get high memory-BW utilization (#5373)

Commit

1 year ago

Optimize the fp-dequantizer to get high memory-BW utilization (#5373) This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>

References

#5373 - Optimize the fp-dequantizer to get high memory-BW utilization

Author

RezaYazdaniAminabadi

Parents

cc9e7b9c

Files1

csrc/fp_quantizer
- quantize.cu

DeepSpeed a8b82153 - Optimize the fp-dequantizer to get high memory-BW utilization (#5373)

DeepSpeed
a8b82153 - Optimize the fp-dequantizer to get high memory-BW utilization (#5373)