Ampere has CUDA_MAX_THREADS_PER_SM == 2048 (#41138)
Summary:
See: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
page 44, table 5
![image](https://user-images.githubusercontent.com/1032377/86958633-56051580-c111-11ea-94da-c726a61dc00a.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41138
Differential Revision: D22488904
Pulled By: malfet
fbshipit-source-id: 97bd585d91e1a368f51aa6bd52081bc57d42dbf8