Fail loudly when MatMulNBits receives unsupported block_size on CPU EP (#28590)
### Description
`MatMulNBits` with `block_size > 256` (e.g. 512) silently produces
all-zero output on CPU EP. The MLAS dequantization fallback path has a
`default: break;` that skips dequantization entirely, leaving a
zero-initialized buffer that flows into GEMM.
Changes:
- Add `ORT_ENFORCE` in `MatMulNBits` constructor rejecting block sizes
other than {16, 32, 64, 128, 256} — surfaces error at session
initialization
- Replace silent `default: break;` with `ORT_ENFORCE` in
`MlasDequantizeBlockwise`, `MlasQuantizeBlockwise`, and
`MlasBlockwiseQuantizedBufferSizes` as defense-in-depth
- Add regression test `MatMulNBits.UnsupportedBlockSize_512`
### Motivation and Context
Fix for https://github.com/microsoft/onnxruntime/issues/28551.
Users passing `block_size=512` (valid per the op spec, which only
requires power-of-2 ≥ 16) get silently wrong results with no error or
warning. This affects real models like Tencent's Hy-MT1.5-1.8B-2bit GGUF
which uses per-512-element scales.
The fix converts silent miscomputation into an immediate, actionable
error message.
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>