WebGPU QuantizeLinear: per-axis, blocked quantization, Option B packing
Extend the WebGPU QuantizeLinear kernel to support per-axis and blocked quantization modes for int8/uint8 output types.
Shader changes:
- Add get_scale()/get_zero_point() with per-tensor/per-axis/blocked branches
- Add get_blocked_scale_idx() for stride-based blocked index computation
- Convert to Option B packing: each thread quantizes 4 elements and packs via pack4xI8 (no shared memory or workgroupBarrier)
- Hoist scale/zero_point fetch for per-tensor mode
- Fix clamp range for signed types (-128..127 vs 0..255)
C++ changes:
- Remove ORT_NOT_IMPLEMENTED for blocked quantization
- Add blocked uniforms: block_size, norm_dim_on_axis, scale_dim_times_axis_stride
- Register kernels for opsets 13-18, 19-20, 21+
Tests: add int8 per-tensor, per-axis, and blocked quantization tests with exact values