Avoid CUDA reinit error in CI tests (#7977)
Currently the CI full test shows a [CUDA reinit
error](https://github.com/deepspeedai/DeepSpeed/actions/runs/24444633640/job/71417719445).
This PR includes the following fixes:
- Fix `compute_capability_args()` in JIT mode to read
`TORCH_CUDA_ARCH_LIST` before calling
`torch.cuda.get_device_capability()` and restores JIT builder state
after `jit_load()`. It also adds regression tests for the explicit-arch,
bad-fork, and restore paths.
- Delay initialization of CUDA streams in DeepCompile
After this fix, the full test
[passed](https://github.com/deepspeedai/DeepSpeed/actions/runs/24508304055/job/71632434455)
again.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>