CUDA Vectorized Dropout (#33879)
Summary:
Add vectorization to dropout kernels for both reads & writes. Moved the `masked_scale_kernel` implementation to `TensorIterator` to pick up recent autovectorization additions by zasdfgbnm , and wrote a vectorized specialization of the dropout training kernel (along with some fairly conservative dispatch logic).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33879
Differential Revision: D20222853
Pulled By: ngimel
fbshipit-source-id: 711f56ca907fbc792a10d4bf069c28adab7d6ad7