DeepSpeed
a8b82153 - Optimize the fp-dequantizer to get high memory-BW utilization (#5373)

Commit
1 year ago
Optimize the fp-dequantizer to get high memory-BW utilization (#5373) This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Parents
  • csrc/fp_quantizer
    • File
      quantize.cu