pytorch
9921b485 - Extend Inductor to support the third-party backend (#106874)

Commit View On GitHub

Commit

1 year ago

Extend Inductor to support the third-party backend (#106874) ## Summary This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression. ## Root Cause Regarding the C++/OpenMP backend, `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`. https://github.com/pytorch/pytorch/blob/c14cf312c9302be2cafd3be61484f5f9613fab63/torch/_inductor/codegen/cpp.py#L2883C53-L2883C53 In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend. ```python def init_backend_registration(self): if get_scheduling_for_device("cpu") is None: from .codegen.cpp import CppScheduling register_backend_for_device("cpu", CppScheduling, WrapperCodeGen) if get_scheduling_for_device("cuda") is None: from .codegen.triton import TritonScheduling register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen) ``` ## Solution To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back. ## Compilation Latency Performance Result We ran a single model benchmark and reproduced the compilation regression: - Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart` - W/ PR #100706, the compilation latency is about **57~58** ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7 ``` - W/O PR #100706, the compilation latency is about **46~47** ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7 ``` This PR fixed the compilation performance regression. - W/ this PR #106874, the compilation latency is about **47~48** ``` dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7 cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874 Approved by: https://github.com/jansel

Author

EikanWang

Committer

pytorchmergebot

Parents

6c0bba3d

pytorch 9921b485 - Extend Inductor to support the third-party backend (#106874)

Commit

pytorch
9921b485 - Extend Inductor to support the third-party backend (#106874)