Improve OOM error message (#99699)

Commit

1 year ago

Improve OOM error message (#99699) This PR adds calls to nvml during an OOM to find out the total memory in use by the process and any other CUDA processes on the device. This makes it easier to identify cases where non-PyTorch libraries have allocated memory or another process (such as a data loader) has also allocated memory on the device. This also rewords the other parts of the error message to make the meaning of the memory statistics more clear with this new information: """ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB. GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free. Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory, this process has 15.32 GiB memory in use. Of the allocated memory 14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF """ Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699 Approved by: https://github.com/ngimel

Author

zdevito

Committer

pytorchmergebot

Parents

c39aff10

pytorch 8548cb3d - Improve OOM error message (#99699)

pytorch
8548cb3d - Improve OOM error message (#99699)