Improve caching allocator for Pascal and newer GPUs. (#17120)
Summary:
```
NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The
page size increased from 1MB to 2MB and allocations larger than 1MB
are now always page-aligned. Previously, allocations larger than 1MB
were aligned to 128KB boundaries.
This interacted poorly with the caching allocator. The remaining
memory in a page could only be filled by small cudaMalloc calls, but
the caching allocator never cudaMalloc's a chunk smaller than 1MB.
This behavior could also cause a large discrepancy between the memory
usage reported by nvidia-smi and the memory usage reported by
PyTorch, because nvidia-smi counts a partially used page as "full",
while PyTorch only counts the actual memory requested.
This PR makes a few changes to the caching allocator to better support
Pascal and Volta GPUs:
- All cudaMalloc calls are now multiples of 2MB (the page size)
- Requests between 1-10MB allocate (and split) a 20MB block to
reduce wasted space due to rounding
- Small requests are now packed into 2MB blocks (instead of 1MB)
This improves Mask R-CNN memory usage by 10-20% in internal tests on
Volta GPUs. Maxwell performance seems to be largely unchanged, but
it's possible that some use cases suffer slightly.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17120
Differential Revision: D14301536
Pulled By: colesbury
fbshipit-source-id: a8282315ea8f7b8ca149b5066fdeaecd0d404edf