Use a single stream for cuda graph pool (#97419)
Previously, we would use the same memory pool but not actually reuse the same memory. The peak memory showed good numbers, but real memory use was much higher because we had a bunch of unallocated segments that could not be reused.
As stated in comments:
NB: cuda caching allocator will remember the stream a segment is allocated to
and only allocate that segment to the same stream. we need to use a single stream
for all allocations to the memory pool, otherwise the allocations to separate streams
will not be reused; separate recordings would have use the same memory pool, but not
the same memory.
Thanks to @zdevito for help debugging this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97419
Approved by: https://github.com/ngimel