Allocating warp to an input index in compute_cuda_kernel (#43354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43354
Instead of assigning a thread to an input index for repeating that index, we assign a warp to an index. This helps us in avoiding the costly uncoaelesced memory accesses and brach divergence which occur when each thread is repeating the index.
Test Plan: Run trainer to test
Reviewed By: ngimel
Differential Revision: D23230917
fbshipit-source-id: 731e912c844f1d859b0384fcaebafe69cb4ab56a