Vectorize non-persistent Softmax kernels (#36485)
Summary:
Add read/write vectorization to non-persistent softmax kernels only. At this point launch logic has minimal changes, and `ILP=vectorization=2` is always used (the code can handle other values, but `ILP=2` has been the most consistent performer).
Dispatch to persistent / non-persistent kernels is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36485
Differential Revision: D21477775
Pulled By: ngimel
fbshipit-source-id: 9ff7fd243695d7bbf4121390085b64db0bbdef35