transformers
7f8b9091 - Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` (#23332)

Commit

2 years ago

Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` (#23332) When working on TorchInductor, I realised that there was a part from `XLNetLMHeadModel` that was being compiled to CPU code. This PR should allow to fuse this operation with other CUDA operations in `torch.compile`. It also should be faster on eager mode, as it has a this implementation has a lower foot-print. If in-place operations are not allowed even in non-grad context, I still believe that doing ones + tril rather than a ones + tril + zeros + cat should be faster simply due to the number of memory reads/writes. I tested that this code produces the same results for `0 <= qlen,mlen < 10` and `same_length in (True, False)`.

References

#23332 - Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel`

Author

lezcano

Parents

8c8744a9

transformers 7f8b9091 - Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` (#23332)

transformers
7f8b9091 - Compute the mask in-place, with less memory reads, and on CUDA on `XLNetLMHeadModel` (#23332)