[mgpu] Fix for mbarriers with `order_tensor_core=True`.
At least in theory (according to PTX docs), there is no guaranteed ordering between `bar.sync` instructions (used by warpgroup barriers) and tcgen05 instructions, so we need to issue the warpgroup barrier after the `::before_thread_sync` instruction.
PiperOrigin-RevId: 881340629