[inductor] improve cpp vec implementation of square (#96072)
For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance.
Micro-benchmark performance for eager v.s. inductor:
op=`aten.native_batch_norm.default`
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>
<body link="#0563C1" vlink="#954F72">
suite | improvement_0.2 | improvement_0.5 | improvement_0.8 | current_speedup_0.2 | new_speedup_0.2 | current_speedup_0.5 | new_speedup_0.5 | current_speedup_0.8 | new_speedup_0.8
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | 8.82% | 5.53% | 32.19% | 0.608006834 | 0.661613139 | 0.691743711 | 0.729987622 | 0.76176223 | 1.00694842
timm | 59.30% | 63.01% | 94.77% | 0.650648524 | 1.036498047 | 0.676425152 | 1.102667387 | 0.695693384 | 1.354992423
</body>
</html>
Model training performance for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>
<body link="#0563C1" vlink="#954F72">
model | improvement | current_speedup | new_speedup
-- | -- | -- | --
lcnet_050 multi-thread | 5.16% | 1.046 | 1.1
lcnet_050 single-thread | 21.81% | 0.94 | 1.145
mobilenet_v2 multi-thread | 3.88% | 1.135 | 1.179
mobilenet_v2 single-thread | 37.46% | 0.929 | 1.277
</body>
</html>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire