[MLIR][GPU] Ensure all lanes in cluster have final reduction value (#165764)
This is a fix for a cluster size of 32 when the subgroup size is 64.
Previously, only lanes [16, 32) u [48, 64) contained the correct
clusterwise reduction value. This PR adds a swizzle instruction to
broadcast the correct value down to lanes [0, 16) u [32, 48).