pytorch
2b7236a0 - [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)

Commit View On GitHub

Commit

1 year ago

[torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032) This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).ompiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032 Approved by: https://github.com/soumith, https://github.com/jansel

Author

zdevito

Committer

pytorchmergebot

Parents

945d333a

pytorch 2b7236a0 - [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)

Commit

pytorch
2b7236a0 - [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)