[inductor] reusing autotuning sub-processes (#97219)
The major cost of doing autotuning in sub process is process creating and initialization. Previously we do that for each benchmark task. This PR reuse a child process as long as it has not crashed yet. This improves compiling time a lot. It's still a bit slower than single process tuning though. Here are the comparison between single process tuning and multi-process tuning:
- if a benchmark task will crash the process, then single process tuning is a no-go
- if a benchmark task works fine, then tuning in child process will be slower. We will try to leveraging multi-GPU to further speed this up.
TLDR for the compilation time: we reduce the 11x slowdown to 1.5x. We'll try to further improve that.
Here are the compilation time comparison:
Single process autotuning:
```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
triton_mm_plus_mm_0 0.0276s 100.0%
triton_mm_plus_mm_6 0.0287s 96.4%
triton_mm_plus_mm_5 0.0307s 90.0%
triton_mm_plus_mm_1 0.0317s 87.1%
triton_mm_plus_mm_7 0.0379s 73.0%
ref_mm_plus_mm 0.0389s 71.1%
triton_mm_plus_mm_2 0.0399s 69.2%
triton_mm_plus_mm_3 0.0410s 67.5%
triton_mm_plus_mm_4 0.0410s 67.5%
SingleProcess AUTOTUNE takes 9.04686689376831 seconds
```
Naive multi process tuning (not reuse child process): 11x slower than single process autotuning
```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
triton_mm_plus_mm_0 0.0287s 100.0%
triton_mm_plus_mm_6 0.0287s 100.0%
triton_mm_plus_mm_1 0.0317s 90.3%
triton_mm_plus_mm_5 0.0317s 90.3%
triton_mm_plus_mm_7 0.0379s 75.7%
ref_mm_plus_mm 0.0389s 73.7%
triton_mm_plus_mm_2 0.0399s 71.8%
triton_mm_plus_mm_3 0.0399s 71.8%
triton_mm_plus_mm_4 0.0420s 68.3%
SubProcess AUTOTUNE takes 101.22216320037842 seconds
```
Multi process tuning reusing child process (this PR): 1.5x slower than single process autotuning
```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
triton_mm_plus_mm_0 0.0276s 100.0%
triton_mm_plus_mm_6 0.0287s 96.4%
triton_mm_plus_mm_5 0.0307s 90.0%
triton_mm_plus_mm_1 0.0317s 87.1%
triton_mm_plus_mm_7 0.0379s 73.0%
ref_mm_plus_mm 0.0389s 71.1%
triton_mm_plus_mm_2 0.0399s 69.2%
triton_mm_plus_mm_3 0.0410s 67.5%
triton_mm_plus_mm_4 0.0410s 67.5%
SubProcess AUTOTUNE takes 13.752070665359497 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97219
Approved by: https://github.com/ngimel