[Mosaic GPU] Use TiledLayout for TMEM layouts
TMEM is actually remarkably similar to the register file: it's partitioned into
4 banks that can't communicate (just like no communication is possible between the
register files in each of the 4 quadrants in an SM), and each one of those banks has
32 lanes. Overall this approach seems to work quite well and should hopefully let us
easily implement new layouts such as those for MMA with m=64.
PiperOrigin-RevId: 774816973