make autocast cache global instead of thread-local (#86492)
Summary:
There is a memory leak because `torch.clear_autocast_cache()` clears
the autocast cache from the main thread, but autograd can write to
this cache from a background thread, so whatever autograd writes
will leak.
With some offline discussion we decided that a global cache is a
practical way to deal with this, and the performance impact of the
lock should be negligible.
Test Plan:
I don't have a local repro of the original issue, need to look into how to get
that.
A toy example
(https://gist.github.com/vkuzo/0d6318fe7f7cb1c505e370cd5c1a643b)
does cache clearing as expected on forward and backward pass.
local testing:
```
python test/test_cuda.py -k autocast
python test/test_autocast.py
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86492
Approved by: https://github.com/ezyang