Update MatMulBNits spec and Add Input Checks (#24828)
### Description
Major changes of spec:
* 2D scale shape: [N * n_blocks_per_col] => [N, n_blocks_per_col]
* 2D zero shape: [N * CeilDiv(n_blocks_per_col * bits, 8)] => [N,
CeilDiv(n_blocks_per_col * bits, 8)]
* For B, drop int32 type and only allow uint8.
* allow bfloat16 as input/output type.
* Mark input g_idx as deprecated (since it has no benefit on model size
and performance in inference).
Add a function CheckInputs to verify the input shape.
The reason of the shape change is to make scale and zero compatible with
other operators like DequantizeLinear and GatherBlockQuantized. That
will make it easy for graph fusion and model builder.
Note that ORT can still handle the legacy 1D format for scale and zero
points, and CUDA/CPU could still handle g_idx. However, they are
deprecated, and our tools shall generate 2D scale and zeros, and avoid
using g_idx going forward.
This change is backward compatible. Model from old spec can run in
latest ORT (CheckInputs handles 1D scale and zero points), and model
from latest spec can still run in older ORT (since older ORT does not
check dimension of scale and zero points)
### Motivation and Context
CUDA and CPU provider does not check inputs for MatMulNBits. It could
cause out of boundary access.
We are going to share the lm_head weights of MatMulNBits to
GatherBlockQuantized. 2D shape can be used in Gather directly, and we
can avoid Reshape nodes.
Our latest models published for foundry use 2D scale and zero points. So
I update the spec to reflect that.