FP [6,8,12] quantizer op (#5336)

Commit

2 years ago

FP [6,8,12] quantizer op (#5336) Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support Requires Ampere+ architecture, this is due to the initial focus of this op only on `bfloat16` input types. Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>