pytorch
e168dbb9 - [inductor] improve cpp vec implementation of square (#96072)

Commit

1 year ago

[inductor] improve cpp vec implementation of square (#96072) For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance. Micro-benchmark performance for eager v.s. inductor: op=`aten.native_batch_norm.default` <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> suite | improvement_0.2 | improvement_0.5 | improvement_0.8 | current_speedup_0.2 | new_speedup_0.2 | current_speedup_0.5 | new_speedup_0.5 | current_speedup_0.8 | new_speedup_0.8 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- torchbench | 8.82% | 5.53% | 32.19% | 0.608006834 | 0.661613139 | 0.691743711 | 0.729987622 | 0.76176223 | 1.00694842 timm | 59.30% | 63.01% | 94.77% | 0.650648524 | 1.036498047 | 0.676425152 | 1.102667387 | 0.695693384 | 1.354992423 </body> </html> Model training performance for eager v.s. inductor: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> model | improvement | current_speedup | new_speedup -- | -- | -- | -- lcnet_050 multi-thread | 5.16% | 1.046 | 1.1 lcnet_050 single-thread | 21.81% | 0.94 | 1.145 mobilenet_v2 multi-thread | 3.88% | 1.135 | 1.179 mobilenet_v2 single-thread | 37.46% | 0.929 | 1.277 </body> </html> Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire

Author

Valentine233

Committer

pytorchmergebot

Parents

bf01caf2

pytorch e168dbb9 - [inductor] improve cpp vec implementation of square (#96072)

pytorch
e168dbb9 - [inductor] improve cpp vec implementation of square (#96072)