[PyTorch] Support NVTX range_start and range_end (#70030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70030
range_push and range_pop do not support multi-thread. It only works for push and pop range in the same thread.
For process level ranges, we should use range_start and range_end. This is important because PyTorch forward is on one thread, while the autograd is on a different thread.
See NVidia implementation documentation:
https://github.com/nvpro-samples/shared_external/blob/cab2dec7608ebc9d36fb086a07ce5112700b089d/NSight/nvToolsExt.h#L397-L407
Test Plan:
```
buck test caffe2/test:cuda
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
✓ ListingSuccess: caffe2/test:cuda - main (19.640)
Summary
ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/8162774391483460
```
Reviewed By: malfet
Differential Revision: D33155244
fbshipit-source-id: c7d5143f6da9b6ef0e0811e2fcae03a3e76f24de
(cherry picked from commit 22134e91b7580730c6a47d23790f75acb9c1fd86)