Pool cudaEvents in CUDACachingAllocator (#78279)
Summary:
cudaEventCreate/Destroy can be expensive especially when the process is calling lots of other CUDA APIs.
Pool the `cudaEvent_t` objects so that we create them once and reuse as much as possible.
Test Plan:
Unit tests to check the functionality.
Manual performance testing shows that this diff is perf positive.
| | create_event_internal (us) | free_event_internal/destructor (us) | insert_events (us) | process_events (us) |
| baseline | 2.411 | 2.647 | 3.968 | 0.321 |
| this diff | 0.115 | 0.147 | 2.846 | 0.262 |
| speed up | 20.9x | 18.0x | 1.4x | 1.2x |
Differential Revision: D35729059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78279
Approved by: https://github.com/jianyuh