[Mosaic GPU] Add warpgroup lowering for `RunState` in Pallas.
After this change we no longer skip tests that required 'RunState`. This necessitated a small fix in the pallas lowering of `while` and also enabling multiple i32 register bundling in the `optimization_barrier` lowering.
PiperOrigin-RevId: 745065173