Lock optimizations for DistAutogradContainer. (#36529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36529
DistAutogradContainer is a singleton for the entire process and has a
single lock that protects access to map keyed by context_id. Performance
profiling showed that this lock is a potential bottleneck for training. As a
result, in this PR, we have the following optimizations:
1) Shard the map into 256 buckets with each bucket having its own lock. This
would ensure we hold much finer grained locks.
2) sendReleaseContextRpc was being called under a lock, moved this to be
outside the lock.
ghstack-source-id: 102085139
Test Plan: waitforbuildbot
Differential Revision: D21003934
fbshipit-source-id: 55f80dd317311bce0efd3ca8ca617d071297b5dc