[cpu] implement erf based on oneDNN algorithm for aten::Vec (#91613)
Aten's `erf` implementation will invoke `MKL` function which shows better performance than current Torchinductor's `erf` implementation who calls `sleef` function in `aten::Vec`. The performance benefits from the algorithm. `sleef` uses the Taylor expansion more precise than `MKL`, resulting in longer time-consuming. As the implementations of `erf` in `oneDNN` and `MKL` are similar, we implement the algorithm of `erf` in `aten::Vec` based on `oneDNN` algorithm.
Performance data for eager v.s. inductor:
`gelu` also benefits from this modification for it uses `erf`.
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>
<body link=blue vlink=purple>
suite | op_name | improved_ratio_speedup0.2 | improved_ratio_speedup0.5 | improved_ratio_speedup0.8 | speedup_old_0.2 | RSD(3) | speedup_old_0.5 | RSD(3) | speedup_old_0.8 | RSD(3) | speedup_new_0.2 | RSD(3) | speedup_new_0.5 | RSD(3) | speedup_new_0.8 | RSD(3)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | aten.erf.default | 138.54% | 138.54% | 138.54% | 0.402057897 | 13.54% | 0.402057897 | 13.54% | 0.402057897 | 13.54% | 0.959050302 | 4.21% | 0.959050302 | 4.21% | 0.959050302 | 4.21%
torchbench | aten.gelu.default | 196.94% | 16.28% | 3.28% | 0.303611506 | 0.88% | 0.865411422 | 0.23% | 0.984732108 | 0.15% | 0.901534389 | 1.04% | 1.006314977 | 0.10% | 1.017019831 | 0.37%
huggingface | aten.gelu.default | 178.90% | 153.93% | 22.70% | 0.324031619 | 8.16% | 0.40085369 | 1.67% | 0.839170801 | 1.30% | 0.90371451 | 2.25% | 1.017872459 | 0.47% | 1.029638829 | 0.49%
timm | aten.gelu.default | 12.76% | 3.01% | 1.98% | 0.892005539 | 0.22% | 0.979783341 | 0.16% | 0.998917466 | 0.08% | 1.005821648 | 0.11% | 1.009227094 | 0.07% | 1.018701655 | 0.30%
torchbench | aten.gelu_backward.default | 124.25% | 53.19% | 5.96% | 0.437150835 | 6.11% | 0.664341696 | 0.24% | 0.983091818 | 2.49% | 0.980304388 | 1.86% | 1.017688734 | 0.33% | 1.041684409 | 0.74%
huggingface | aten.gelu_backward.default | 126.26% | 32.55% | 11.61% | 0.446699743 | 0.34% | 0.781550075 | 0.73% | 0.989682073 | 0.28% | 1.010687581 | 1.31% | 1.035929929 | 1.11% | 1.104549968 | 2.68%
timm | aten.gelu_backward.default | 5.65% | 1.79% | 2.58% | 0.955116562 | 0.40% | 0.99782989 | 0.18% | 1.002408412 | 0.13% | 1.00905163 | 0.07% | 1.015649447 | 0.26% | 1.028238613 | 0.24%
</body>
</html>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91613
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/EikanWang, https://github.com/desertfire