improve CUDACachingAllocator lock contention (#118550)
Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention.
Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements
Differential Revision: D52493091
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550
Approved by: https://github.com/albanD