FP6 quantization end-to-end.
* Initialize the fp6-quant-kernel integration.
* Add necessary parameters of kernel interfaces and the linear layer selection logic.
* upload kernel code
* The simple script for debugging.
* fix typo
* update
* fix split k
* Fix some errors and add test case.
* Workspace for Inference Kernels (#1)
* Add transform_param functions and update format.
* kernel debug
* fix include
* Update core_ops.cpp
* Add split k support
* fix
* Fix kernel error
* update
* update
* Fix rebase errors.
* Add missed include.
* Fix the bug that the attribute uses the weight information for mem alloc.
* Avoid GPU preallocation during weight loading.
* Add support of larger shapes for gated activation kernel.
* update
* model update
* fix all weight preprocessing
* Add split-k heuristic.
* Avoid reading scale attribute on non-quantized tensors.
* Change the scales from attributes to new tensors. Provide the end-to-end script given HuggingFace model id.
* Hard-coded commented out the scales in the kernel to workaround the bug.
* Support the user config for quantization. Fix kernel bug.
* Per operator test functions.
* Multiply scales by 1e12 according to the kernel design.
* Revert "Workspace for Inference Kernels (#1)". This reverts commit 1528732bd2ca54bae248846c6dac34729ac97cdf.
* Remove the format-only changes.
* Put the quantization into the transform_param function.
---------
Co-authored-by: Shiyang Chen <csycfl@gmail.com>
Co-authored-by: Haojun Xia <xhjustc@mail.ustc.edu.cn>