[Pallas:MGPU] Expose multicast TMA stores
They use a new MulticastRef transform. Unfortunately we can't use it to implement regular
multicast stores, as we always use swap_p to represent stores and there's no good dual
to multicast store (ld_reduce is not what we want!).
PiperOrigin-RevId: 819742981