Add per-device allocator object in CUDACachingAllocator (#37567)
Summary:
Reduces lock contention and BlockPool management costs by tracking applicable state in per-device structures.
`THCCachingAllocator` now maintains a set of `DeviceCachingAllocator` objects (one per device) each of which maintains its own allocator state and operations.
Only global state remains in the top-level THCCachingAllocator object -- namely, `allocated_blocks`, the mapping between the raw storage pointers and the allocator's underlying Block structure. Global operations deal mostly with this translation and then pass the bulk of the work on to the device-specific allocator.
Conversely, device-specific state and operations are comprised mostly of managing the device's underlying blocks.
This has the following benefits:
- Performance: Access to the global pointer map is serialized independently of the per-device state -- reducing lock contention between operations on different devices.
- Simplicity: Managing the block pools in separate device-specific objects is conceptually more intuitive, simplifies the code and makes certain operations more efficient -- even in the absence of contention (e.g. free_cached_blocks, synchronize_and_free_events, emptyCache, get_all_blocks, etc.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37567
Differential Revision: D21458556
Pulled By: colesbury
fbshipit-source-id: ef56cb373797b180df72f0998ebc35972c892288