jax
8f21ab30 - [Pallas:MGPU] Use a much better matmul kernel in the collective matmul

Commit

122 days ago

[Pallas:MGPU] Use a much better matmul kernel in the collective matmul Turns out it wasn't the collective part that was holding us back, but the matmul part. Now that we have a really good matmul kernel, we can simply plug it into the collective loop, and add a tiny part that does the sends. In my simple benchmarking setup it already seems to always beat the NCCL+cuBLAS baseline within a single host. PiperOrigin-RevId: 810794450

References

#31381 - Remove old ROCm build code

#31768 - [ROCm] Support lowering through PJRT_Triton_Extension

#32010 - [Pallas:MGPU] Use a much better matmul kernel in the collective matmul

#32115 - Relax version requirements for ROCm Jax Plugin wheels

#33157 - [ROCm] Resolve undefined behavior in bitshift unit test

#579 - Create rocm-test-requirements.txt

#581 - Fix/pallas tests shared memory

#584 - Use plain bazel to test jax, use hermetic rocm dependency

#585 - update a test for checking zero ROCm GPU event

#34135 - [ROCm] update to test if there are GPU events when doing profiling on…

#591 - Adding run_pytest_rocm.sh