[pallas:mosaic_gpu] `copy_smem_to_gmem` now allows skipping `cp.async.commit_group`
This feature is necessary to fix the SMEM->GMEM waiting behavior in
`emit_pipeline`, which used a pessimistic condition prior to this change,
since every copy was its own commit group.
PiperOrigin-RevId: 734553668