[JIT] Initialize CUDA context before launching fused kernel (#65064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65064
The problem appears when nvfuser is triggered from LazyTensor.
Because LT maintains its own thread pool, the thread used for the first-time
compilation does CUDA context initialization properly, but later
cached execution may use a different thread which does not have
a proper CUDA context.
Test Plan: Imported from OSS
Reviewed By: saketh-are
Differential Revision: D31269691
Pulled By: desertfire
fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2