[webgpu] Split large inputs into smaller buffers to bypass maxStorageBufferBindingSize limit (#25962)
### Description
When an input is bigger than maxStorageBufferBindingSize, use multiple
binding entries for it. We refine the implement for
`getByOffset`/`setByOffset` so that let's say, `input_b` is 257MB, but
maxStorageBufferBindingSize is 256MB, we can use `b.getByOffset(offset)`
to get the correct element and no need to care about the different
binding entry. Actually, it will generate shader code like this.
```
var<storage, read> input_b: array<vec4<u32>>; // [0, 256MB) of input_b
var<storage, read> input_b1: array<vec4<u32>>; // [256MB, 257MB) of input_b
```
### Motivation and Context
QC's maxStorageBufferBindingSize is 256MB, which is not enough for phi-4
model. So for QC, we customized a new phi4 model which use `slice` op to
split the big matrix. That means we need to keep two different phi4
model for different platform.
### For reviewers
The core logic is located
- Shader side:
- `shader_helper.cc`. In shader, use more`@group(0) @binding(....`
matched the actual buffer numbers.
- `shader_variable.cc`. Implement `set_xxx_by_offset(global_offset,
value)` and `get_xxx_by_offset(global_offset)` shader helper function,
which will be used when using `setByOffset`/`getByOffset` and the input
exceed the maxstoragebuffersize.
- WebGPU API side:
- `webgpu_context.cc`. In WebGPU API, use more group entry matched the
actual buffer numbers.