WebGPU QuantizeLinear: add per-axis support and int8 fixes
- Fix clamp range: use type-dependent constants (-128..127 for int8, 0..255 for uint8) instead of hardcoded (0, 255)
- Fix zero-point unpacking: use unpack4xI8 for signed types
- Add per-axis quantization support in WGSL shader and C++ host code
- Register QuantizeLinear kernels for opsets 13-18, 19-20, and 21
- Add int8 tests with exact-division scales to avoid GPU FP precision issues
- Exclude 3 existing Int8 tests from WebGPU EP due to FP division precision (same as DML)