Introduce generic MultiStreamGuard (#57049)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57049
There was a comment above CUDAMultiStreamGuard which said "TODO: Implement this generically in c10". This is what I'm doing here.
The new generic MultiStreamGuard class is able to take a vector of device-agnostic c10::Streams and is able to support any device type (CUDA, but also ROCm and others) by using a VirtualGuardImpl. A class called CUDAMultiStreamGuard is still kept around, for convenience, and slightly for performance as it avoids a vtable lookup.
ghstack-source-id: 127713139
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28029158
fbshipit-source-id: 2f3181371f8cb0d77a3b2e6aa510f1dd74e8f69b