reduce overhead of get_current_stream (#78066)
This reduces overhead of `torch.cuda.current_stream()` from ridiculous 8.7 us to still ridiculous 4.4 us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78066
Approved by: https://github.com/mruberry
Author
Natalia Gimelshein