[Pallas:MGPU] Expose the optimized SMEM/GMEM copy layout
We can implement synchronous SMEM/GMEM copies using regular loads/stores with
`plgpu.layout_cast` to the right layouts. We could alternatively do it as a
dedicated primitive that calls `mgpu.copy_tiled`, but the current way is more
future-proof. Once we transition fully to the layout inference pass the layout
cast should become unnecessary and the load/store should be the program we want
to emit anyway.
PiperOrigin-RevId: 794567794