Set CUDA arch correctly when building with torch.utils.cpp_extension (#23408)
Summary:
The old behavior was to always use `sm_30`. The new behavior is:
- For building via a setup.py, check if `'arch'` is in `extra_compile_args`. If so, don't change anything.
- If `TORCH_CUDA_ARCH_LIST` is set, respect that (can be 1 or more arches)
- Otherwise, query device capability and use that.
To test this, for example on a machine with `torch` installed for py37:
```
$ git clone https://github.com/pytorch/extension-cpp.git
$ cd extension-cpp/cuda
$ python setup.py install
$ cuobjdump --list-elf build/lib.linux-x86_64-3.7/lltm_cuda.cpython-37m-x86_64-linux-gnu.so
ELF file 1: lltm.1.sm_61.cubin
```
Existing tests in `test_cpp_extension.py` for `load_inline` and for compiling via `setup.py` in test/cpp_extensions/ cover this.
Closes gh-18657
EDIT: some more tests:
```
from torch.utils.cpp_extension import load
lltm = load(name='lltm', sources=['lltm_cuda.cpp', 'lltm_cuda_kernel.cu'])
```
```
# with TORCH_CUDA_ARCH_LIST undefined or an empty string
$ cuobjdump --list-elf /tmp/torch_extensions/lltm/lltm.so
ELF file 1: lltm.1.sm_61.cubin
# with TORCH_CUDA_ARCH_LIST = "3.5 5.2 6.0 6.1 7.0+PTX"
$ cuobjdump --list-elf build/lib.linux-x86_64-3.7/lltm_cuda.cpython-37m-x86_64-linux-gnu.so
ELF file 1: lltm_cuda.cpython-37m-x86_64-linux-gnu.1.sm_35.cubin
ELF file 2: lltm_cuda.cpython-37m-x86_64-linux-gnu.2.sm_52.cubin
ELF file 3: lltm_cuda.cpython-37m-x86_64-linux-gnu.3.sm_60.cubin
ELF file 4: lltm_cuda.cpython-37m-x86_64-linux-gnu.4.sm_61.cubin
ELF file 5: lltm_cuda.cpython-37m-x86_64-linux-gnu.5.sm_70.cubin
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23408
Differential Revision: D16784110
Pulled By: soumith
fbshipit-source-id: 69ba09e235e4f906b959fd20322c69303240ee7e