accelerate
140acb35 - Fix AMD GPU support with should_reduce_batch_size() (#3405)

Commit

1 year ago

Fix AMD GPU support with should_reduce_batch_size() (#3405) * Fix AMD GPU support with should_reduce_batch_size() Even though torch has NVIDIA and AMD GPUs operate under the cuda namespace, the out of memory error for AMD GPUs is different. When trying to determine if a model can fit on an AMD GPU, this function will evaluate to false for a `torch.OutOfMemoryError`. This PR adds another check for the error string. Example error messge: ``` 'HIP out of memory. Tried to allocate 64.00 GiB. GPU 0 has a total capacity of 63.98 GiB of which 48.63 GiB is free. Of the allocated memory 15.02 GiB is allocated by PyTorch, and 129.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)' ``` * Missing comma * Update memory.py Consolidate OOM error check string

References

#3405 - Fix AMD GPU support with should_reduce_batch_size()

Author

cameronshinn

Parents

8576112b

accelerate 140acb35 - Fix AMD GPU support with should_reduce_batch_size() (#3405)

accelerate
140acb35 - Fix AMD GPU support with should_reduce_batch_size() (#3405)