jax
27a568a3 - [Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.

Commit
19 days ago
[Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs. There's an additional constraint that the transfer size must be a multiple of 16 bytes and not overflow the dst/src memory spaces. We handle this by rounding down to the largest multiple of 16 bytes. Read more here: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-bulk-copy PiperOrigin-RevId: 855286258
Author
Parents
Loading