[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 (#106570)
An alternative to #106235 that just adds our own uid generation so that we can call `beginAllocateStreamToPool` (which notifies the caching allocator that a capture is starting) before actually starting the capture. Note that this does appear to change the behavior uid generation a bit from the CUDA API call (which seems to increment by 3 each time instead of 1).
Looking at the changes again I'm not sure if both the _begin_ capture ordering change is needed in addition to the _end_ capture ordering change, but it makes me uneasy as I'm not sure anything prevents the autograd thread from running cleanup code "in-between" captures.
CC @zdevito @eellison
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106570
Approved by: https://github.com/zdevito