Increase DenseNet121 Batch Size for better Utilization (#496)
Summary:
DenseNet121 in the initial research paper (https://arxiv.org/pdf/1608.06993.pdf) uses
major dataset ImageNet with the shape (3, 224, 224) and batch size 256. To match
this benchmark's profile closer to the community, increase the batch size to 256.
Here are experimental inference numbers on A100 (40GB GPU Memory):
Batch Size | GPU Time (ms) | CPU Dispatch Time (s) | CPU Total Time (s) | Time Increase to last BS | Notes
-- | -- | -- | -- | -- | --
16 | 46.71795 | 0.04666 | 0.04672 | 0.00% | Overhead hides GPU work, very idle
32 | 49.71725 | 0.04963 | 0.04973 | 6.42% |
64 | 54.9376 | 0.05488 | 0.05496 | 10.50% |
128 | 74.36391 | 0.04987 | 0.07437 | 35.36% |
256 | 144.27956 | 0.05129 | 0.14429 | 94.02% | Best Batch Size
512 | | | | | RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
1024 | | | | | RuntimeError: CUDA out of memory. Tried to allocate 2.30 GiB (GPU 0; 39.59 GiB total capacity; 35.85 GiB already allocated; 1.83 GiB free; 35.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Pull Request resolved: https://github.com/pytorch/benchmark/pull/496
Reviewed By: xuzhao9
Differential Revision: D31697847
Pulled By: aaronenyeshi
fbshipit-source-id: d0fbe98c66524a6a1de5b07a404c372aeae518bf