[CUDA Graphs] Fix OOM inside graph capture_begin
release_cached_blocks calls this:
```
void synchronize_and_free_events() {
TORCH_INTERNAL_ASSERT(captures_underway == 0);
```
Which means we can't call that function when we are capturing a cuda graph:
```
import torch
with torch.cuda.graph(torch.cuda.CUDAGraph()):
torch.zeros(2 ** 40, device="cuda")
```
results in:
```
RuntimeError: captures_underway == 0INTERNAL ASSERT FAILED at "/tmp/torch/c10/cuda/CUDACachingAllocator.cpp":1224, please report a bug to PyTorch.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76247
Approved by: https://github.com/ngimel