maximum number of threads per block for sm_86 is 1536 (#45889)
Summary:
according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45889
Reviewed By: albanD
Differential Revision: D24131188
Pulled By: ngimel
fbshipit-source-id: 31d3038f7b1bc403751448c62b19609573c67a49