[Pallas/Mosaic GPU] Implement a more comprehensive matmul kernel to see what we're still missing
I annotated a number of issues in the test. To make the test run I also needed to add support
for the accumulator reference allocation and discharge in the main lowering part. Ideally,
we'd defer it all to run_scoped, but run_scoped can't allocate barriers...
PiperOrigin-RevId: 679143948