Make persistent softmax WARP_SIZE aware. (#25937)
Summary:
Also change documentation to reflect both the CUDA and ROCm facts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25937
Differential Revision: D17291453
Pulled By: bddppq
fbshipit-source-id: ee1d7a34f3ad6c05a8f1564d4f9e516e497f2199