[Mosaic GPU] Add a warp specialized kernel with a separate TMA warp
With this kernel we're able to significantly improve the performance
of large head_dim kernels, reaching ~62% utilization for 4k sequence
length and ~71% for 32k.
TODO: the two kernels are quite similar and it should be possible to
collapse them into one
PiperOrigin-RevId: 647597865