onnxruntime
2bdb57bb - Update MatMulBNits spec and Add Input Checks (#24828)

Commit
216 days ago
Update MatMulBNits spec and Add Input Checks (#24828) ### Description Major changes of spec: * 2D scale shape: [N * n_blocks_per_col] => [N, n_blocks_per_col] * 2D zero shape: [N * CeilDiv(n_blocks_per_col * bits, 8)] => [N, CeilDiv(n_blocks_per_col * bits, 8)] * For B, drop int32 type and only allow uint8. * allow bfloat16 as input/output type. * Mark input g_idx as deprecated (since it has no benefit on model size and performance in inference). Add a function CheckInputs to verify the input shape. The reason of the shape change is to make scale and zero compatible with other operators like DequantizeLinear and GatherBlockQuantized. That will make it easy for graph fusion and model builder. Note that ORT can still handle the legacy 1D format for scale and zero points, and CUDA/CPU could still handle g_idx. However, they are deprecated, and our tools shall generate 2D scale and zeros, and avoid using g_idx going forward. This change is backward compatible. Model from old spec can run in latest ORT (CheckInputs handles 1D scale and zero points), and model from latest spec can still run in older ORT (since older ORT does not check dimension of scale and zero points) ### Motivation and Context CUDA and CPU provider does not check inputs for MatMulNBits. It could cause out of boundary access. We are going to share the lm_head weights of MatMulNBits to GatherBlockQuantized. 2D shape can be used in Gather directly, and we can avoid Reshape nodes. Our latest models published for foundry use 2D scale and zero points. So I update the spec to reflect that.
Author
Parents
Loading