Use _exchange_device to reduce torch.cuda.device overhead (#91127)
This must wait for the forward compatibility period since it requires the
`cuda::_exchange_device` primitive for TorchScript. Also since TorchScript
doesn't support inheritance, we can't just inherit from `_DeviceGuard` here.
This saves around 2 us per `with` statement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91127
Approved by: https://github.com/ngimel