[Inductor max autotune] Multithreaded Precompilation (#119386)
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.
This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).
Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386
Approved by: https://github.com/aakhundov