[pallas:mgpu] Use two barriers for try-cancel barriers in `dynamic_scheduling_loop`.
The recent race condition fix prevented any threads from running ahead (as all threads waited at the `cancel_user_barrier` until all threads had completed the previous iteration).
PiperOrigin-RevId: 877964436