[webgpu] Always use tile matmulnbits for block_size = 32 (#23140)
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.
Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.