[Mosaic GPU] Add support for 1CTA MMA with M=64
The TMEM layout is a bit weird (each MMA instruction only accesses half the lanes),
which is why the patch is not entirely trivial.
There's also another layout we could use, where we don't pack the column tiles and split
a 128xN TMEM reference into two 64xN refs, each with half the lanes. It might even be slightly
better for performance, as it will allow us to avoid the awkward N split that we have to do
right now.
PiperOrigin-RevId: 775150187