[Pallas/MGPU] Implement block spec evaluation correctly
The preivous implementation made some surprising assumptions about the contents
of the block specs and wasn't correct in general. The new implementation handles
all the cases and seems to be sufficient to finally run the matmul example with
multiple k steps while producing correct results (it's also shorter!).
PiperOrigin-RevId: 679175212