Update ideep for NNC post-op (#82705)
### Description
This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes:
- element wise post op fusion
- conv/matmal/linear + binary post op fusion
### Performance
**Common configuration:**
- Jemalloc and iomp enabled
- BS=1
- num_warmup = 300
- num_run = 500
- Average time of 1 iteration in ms is used
- time_before: no fusion
- time_after: with fusion
- Eltwise OPs selected: hardswish and abs
- Using oneDNN v2.6
**On ICX (32 cores per socket):
Conv2d FP32 (in channels Last format)**
| shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.112174 | 0.071106 | 36.61%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.11269 | 0.070586 | 37.36%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.164219 | 0.129498 | 21.14%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.169371 | 0.1277 | 24.60%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.994555 | 1.429813 | 28.31%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.715168 | 1.459937 | 14.88%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 2.997382 | 2.47915 | 17.29%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.044476 | 2.499366 | 17.90%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.405204 | 0.38117 | 5.93%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.410145 | 0.389279 | 5.09%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.67917 | 0.662792 | 2.41%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.682302 | 0.671226 | 1.62%
**On CPX (28 cores per socket):
Conv2d BF16 (in channels Last format)**
| shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.119289 | 0.091015 | 23.70%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.144116 | 0.09339 | 35.20%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.209975 | 0.177111 | 15.65%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.234777 | 0.179945 | 23.36%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.296252 | 1.086423 | 16.19%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.364738 | 1.131289 | 17.11%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.99519 | 3.736147 | 6.48%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 4.03415 | 3.77981 | 6.30%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.27474 | 0.245281 | 10.72%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.28595 | 0.254748 | 10.91%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.847318 | 0.791453 | 6.59%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.870212 | 0.801594 | 7.89%
**On CPX (28 cores per socket):
Linear BF16**
| shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Linear+abs_N=1_iC=1024_oC=4096 | 0.043199 | 0.037603 | 12.95%
1socket | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.041845 | 0.038332 | 8.40%
1socket | Linear+abs_N=1_iC=4096_oC=1024 | 0.048282 | 0.044281 | 8.29%
1socket | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.048362 | 0.044106 | 8.80%
1socket | Linear+abs_N=1_iC=2048_oC=1000 | 0.036302 | 0.0344 | 5.24%
1socket | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.035734 | 0.035593 | 0.39%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.365143 | 0.36279 | 0.64%
1thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.364464 | 0.363392 | 0.29%
1thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.384498 | 0.379902 | 1.20%
1thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.382545 | 0.381252 | 0.34%
1thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.213244 | 0.209999 | 1.52%
1thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.212003 | 0.208567 | 1.62%
| | | |
| shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.126096 | 0.12157 | 3.59%
4thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.126627 | 0.121662 | 3.92%
4thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.132845 | 0.128921 | 2.95%
4thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.132642 | 0.12783 | 3.63%
4thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.079582 | 0.072584 | 8.79%
4thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.077761 | 0.071981 | 7.43%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82705
Approved by: https://github.com/frank-wei, https://github.com/eellison