ggml-webgpu: Update register tiling matmul to use f32 accumulation (#21644)
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function