pytorch
3282030f - [inductor] reusing autotuning sub-processes (#97219)

Commit

1 year ago

[inductor] reusing autotuning sub-processes (#97219) The major cost of doing autotuning in sub process is process creating and initialization. Previously we do that for each benchmark task. This PR reuse a child process as long as it has not crashed yet. This improves compiling time a lot. It's still a bit slower than single process tuning though. Here are the comparison between single process tuning and multi-process tuning: - if a benchmark task will crash the process, then single process tuning is a no-go - if a benchmark task works fine, then tuning in child process will be slower. We will try to leveraging multi-GPU to further speed this up. TLDR for the compilation time: we reduce the 11x slowdown to 1.5x. We'll try to further improve that. Here are the compilation time comparison: Single process autotuning: ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0307s 90.0% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_7 0.0379s 73.0% ref_mm_plus_mm 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% SingleProcess AUTOTUNE takes 9.04686689376831 seconds ``` Naive multi process tuning (not reuse child process): 11x slower than single process autotuning ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0287s 100.0% triton_mm_plus_mm_6 0.0287s 100.0% triton_mm_plus_mm_1 0.0317s 90.3% triton_mm_plus_mm_5 0.0317s 90.3% triton_mm_plus_mm_7 0.0379s 75.7% ref_mm_plus_mm 0.0389s 73.7% triton_mm_plus_mm_2 0.0399s 71.8% triton_mm_plus_mm_3 0.0399s 71.8% triton_mm_plus_mm_4 0.0420s 68.3% SubProcess AUTOTUNE takes 101.22216320037842 seconds ``` Multi process tuning reusing child process (this PR): 1.5x slower than single process autotuning ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0307s 90.0% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_7 0.0379s 73.0% ref_mm_plus_mm 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% SubProcess AUTOTUNE takes 13.752070665359497 seconds ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97219 Approved by: https://github.com/ngimel

Author

shunting314

Committer

pytorchmergebot

Parents

0b094ca3

pytorch 3282030f - [inductor] reusing autotuning sub-processes (#97219)

pytorch
3282030f - [inductor] reusing autotuning sub-processes (#97219)