[Tensorexpr] Fix and improve handling multiple gpu devices (#38365)
Summary:
These commits fixes a bug which was exposed when we took away the fallback path. The fix is to set the appropriate device before setting CUDA stream.
The improvement is when compiling, setting the device to new device only if it's different from prior device, and removing redundant call to cudaFree
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38365
Reviewed By: zheng-xq
Differential Revision: D21537469
Pulled By: protonu
fbshipit-source-id: b9662dd623b5c7cfd23eb6894e992a43665641e4