CUDACachingAllocator: Keep one event queue per stream (#71745)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71616
This fixes the leaks in my test case. I have not tested it on our big models yet, but will report back if we can.
This potentially impacts allocator performance in that it slightly increases the amount of CPU memory we allocate for data structures, and it means that `process_events` may look at a larger number of events in the case where there are multiple streams with long-running ops on them.
However, I suspect that in general, either:
- An application isn't using very many streams or very many long-running ops, in which case the performance is essentially the same
- Or, they are, which is precisely the case where https://github.com/pytorch/pytorch/issues/71616 bites you, and so freeing memory faster is probably more valuable than the slight CPU overhead here.
I'm not attached to this approach or any of its details, but figured it was worth throwing up for discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71745
Reviewed By: soulitzer
Differential Revision: D33948288
Pulled By: ngimel
fbshipit-source-id: 73e95f8a9bbe385a77de483d1c58b857b5d84e81
(cherry picked from commit d233719c072341607e6dab226b5cbfe8d316d91f)