[bazel] enable sccache+nvcc in CI (#95528)
Fixes #79348
This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.
Along the way we had to do couple tweaks:
1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.
2. Workaround bug in bazel https://github.com/bazelbuild/bazel/issues/10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.
## Test plan
Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740
Note that now it says "CUDA" in the sccache output
```
+ sccache --show-stats
Compile requests 9784
Compile requests executed 6726
Cache hits 6200
Cache hits (C/C++) 6131
Cache hits (CUDA) 69
Cache misses 519
Cache misses (C/C++) 201
Cache misses (CUDA) 318
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 0
Cache errors 7
Cache errors (C/C++) 7
Non-cacheable compilations 0
Non-cacheable calls 2893
Non-compilation calls 165
Unsupported compiler calls 0
Average cache write 0.116 s
Average cache read miss 23.722 s
Average cache read hit 0.057 s
Failed distributed compilations 0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95528
Approved by: https://github.com/huydhn