FP6 quantization end-to-end.

Commit

2 years ago

FP6 quantization end-to-end. * Initialize the fp6-quant-kernel integration. * Add necessary parameters of kernel interfaces and the linear layer selection logic. * upload kernel code * The simple script for debugging. * fix typo * update * fix split k * Fix some errors and add test case. * Workspace for Inference Kernels (#1) * Add transform_param functions and update format. * kernel debug * fix include * Update core_ops.cpp * Add split k support * fix * Fix kernel error * update * update * Fix rebase errors. * Add missed include. * Fix the bug that the attribute uses the weight information for mem alloc. * Avoid GPU preallocation during weight loading. * Add support of larger shapes for gated activation kernel. * update * model update * fix all weight preprocessing * Add split-k heuristic. * Avoid reading scale attribute on non-quantized tensors. * Change the scales from attributes to new tensors. Provide the end-to-end script given HuggingFace model id. * Hard-coded commented out the scales in the kernel to workaround the bug. * Support the user config for quantization. Fix kernel bug. * Per operator test functions. * Multiply scales by 1e12 according to the kernel design. * Revert "Workspace for Inference Kernels (#1)". This reverts commit 1528732bd2ca54bae248846c6dac34729ac97cdf. * Remove the format-only changes. * Put the quantization into the transform_param function. --------- Co-authored-by: Shiyang Chen <csycfl@gmail.com> Co-authored-by: Haojun Xia <xhjustc@mail.ustc.edu.cn>

References

#5234 - FP6 quantization end-to-end.

Author

JamesTheZ

Committer

JamesTheZ

Parents

bcc617a0

DeepSpeed a4562ab7 - FP6 quantization end-to-end.

DeepSpeed
a4562ab7 - FP6 quantization end-to-end.