Modify TileOp GPU implementation to expose more concurrency and better utilize GPU memory bandwidth (#17275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17275
Previous implementation used a memcpy inside the kernel. It is more efficient to reduce the data fetched per thread to a single word from memory. This exposes more concurrency and takes advantage of GPU memory coalescing support.
Reviewed By: takatosp1
Differential Revision: D14120147
fbshipit-source-id: c4734003d4342e55147c5b858f232a006af60b68