[Mosaic GPU] Add support for reductions when writing to GMEM via TMA in WG semantics.
The new logic supports all reductions, though the underlying `async_copy` only supports `add` for now, so I only added a test for that.
PiperOrigin-RevId: 839655895