[Mosaic GPU] Add support for warp shuffles with elements wider than 32-bit
We simply break them up into smaller shuffles and concatenate the results.
No test changes were necessary, but this path is tested by the hypothesis
test. It previously ignored the test cases that used to raise NotImplementedError,
but now don't.
PiperOrigin-RevId: 786184810