DeepSpeed
3fbd01cc - FP [6,8,12] quantizer op (#5336)

Commit
1 year ago
FP [6,8,12] quantizer op (#5336) Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support Requires Ampere+ architecture, this is due to the initial focus of this op only on `bfloat16` input types. Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Author
Parents
Loading