[Pallas MGPU] Clip the size of contiguous TMA transfers to the size of the ref. This avoids OOB accesses and is more in-line with the tensor map TMAs.
There's an additional constraint that the transfer size must be a multiple of 16 bytes and not overflow the dst/src memory spaces. We handle this by rounding down to the largest multiple of 16 bytes.
Read more here: https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-bulk-copy
PiperOrigin-RevId: 855286258