[CUDA graphs] Changes batchnorm to increment num_batches_tracked in place for improved graph safety (#70444)
Summary:
This PR was not my worst debugging annoyance, nor my smallest in lines changed, but it has the highest `debugging annoyance/lines changed` ratio.
The current pattern
```
self.num_batches_tracked = self.num_batches_tracked + 1
```
, if captured, deletes an eagerly-allocated tensor and overwrites it with a captured tensor. Replays read from the (deallocated) original tensor's address.
This can cause
1. an IMA on graph replay
2. failure to actually increment `num_batches_tracked` during graph replay, because every replay reads from the old location without adding to it
3. numerical corruption if the allocator reassigns the original tensor's memory to some unrelated tensor
4. combinations of 1, 2, and 3, depending on global allocation patterns and if/when the BN module is called eagerly sometimes between replays
(ask me how I know).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70444
Reviewed By: albanD
Differential Revision: D33342203
Pulled By: ngimel
fbshipit-source-id: 5f201cc25030517e75af010bbaa88c452155df21