pytorch
079093a6 - Improve caching allocator for Pascal and newer GPUs. (#17120)

Commit View On GitHub

Commit

5 years ago

Improve caching allocator for Pascal and newer GPUs. (#17120) Summary: ``` NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The page size increased from 1MB to 2MB and allocations larger than 1MB are now always page-aligned. Previously, allocations larger than 1MB were aligned to 128KB boundaries. This interacted poorly with the caching allocator. The remaining memory in a page could only be filled by small cudaMalloc calls, but the caching allocator never cudaMalloc's a chunk smaller than 1MB. This behavior could also cause a large discrepancy between the memory usage reported by nvidia-smi and the memory usage reported by PyTorch, because nvidia-smi counts a partially used page as "full", while PyTorch only counts the actual memory requested. This PR makes a few changes to the caching allocator to better support Pascal and Volta GPUs: - All cudaMalloc calls are now multiples of 2MB (the page size) - Requests between 1-10MB allocate (and split) a 20MB block to reduce wasted space due to rounding - Small requests are now packed into 2MB blocks (instead of 1MB) This improves Mask R-CNN memory usage by 10-20% in internal tests on Volta GPUs. Maxwell performance seems to be largely unchanged, but it's possible that some use cases suffer slightly. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/17120 Differential Revision: D14301536 Pulled By: colesbury fbshipit-source-id: a8282315ea8f7b8ca149b5066fdeaecd0d404edf

Author

colesbury

Committer

facebook-github-bot

Parents

8420a202

pytorch 079093a6 - Improve caching allocator for Pascal and newer GPUs. (#17120)

Commit

pytorch
079093a6 - Improve caching allocator for Pascal and newer GPUs. (#17120)