jax
27a568a3 - [Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.

Commit

19 days ago

[Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs. There's an additional constraint that the transfer size must be a multiple of 16 bytes and not overflow the dst/src memory spaces. We handle this by rounding down to the largest multiple of 16 bytes. Read more here: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-bulk-copy PiperOrigin-RevId: 855286258

References

#34199 - [Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.

Author

Rifur13

Committer

Google-ML-Automation

Parents

7cee4db6

jax 27a568a3 - [Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.

jax
27a568a3 - [Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.