[torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)
This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation
workers.
Previously os.fork overhead and GIL contention limited the achieved
parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation,
and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway
since it is mostly in python.
In cold start situations, the time to get the worker threads started can
be significant portion of the time.
This patch starts the workers earlier so they are ready to perform
compilation (see code comments) when dynamo
gets to that point.
Just tested this on one example benchmark (tf_efficientnet_b0), but the
results are significant, almost eliminating the difference between a
warm and cold compilation.
```
39.613s - warm
41.290s - cold, this patch
2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
(cold compilation is done after running `rm -rf
/tmp/torchinductor_$USER`).ompiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.
Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.
In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.
Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.
```
39.613s - warm
41.290s - cold, this patch
2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
(cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032
Approved by: https://github.com/soumith, https://github.com/jansel