[inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition
Related to https://github.com/pytorch/pytorch/issues/104296
Fixes https://github.com/pytorch/pytorch/issues/105565
Perfs:
- speed-up on cuda with bilinear and nearest modes
```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"
[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
| Eager (2.1.0a0+git1afae24) PR-afgg | Compiled (2.1.0a0+git1afae24) PR-afgg | Compiled (2.1.0a0+git16df542) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 7.467 (+-0.036) | 11.905 (+-0.276) | 13.391 (+-0.051) | 1.125 (+-0.000) | 7.343 (+-0.036)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 7.722 (+-0.168) | 14.371 (+-0.035) | 15.899 (+-0.038) | 1.106 (+-0.000) | 7.870 (+-0.043)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 7.710 (+-0.051) | 11.354 (+-0.053) | 13.376 (+-0.045) | 1.178 (+-0.000) | 7.698 (+-0.061)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 7.870 (+-0.050) | 13.744 (+-0.237) | 15.206 (+-0.102) | 1.106 (+-0.000) | 7.912 (+-0.039)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 4.738 (+-0.015) | 4.508 (+-0.005) | 6.566 (+-0.027) | 1.456 (+-0.000) | 4.630 (+-0.022)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.391 (+-0.010) | 4.860 (+-0.390) | 6.438 (+-0.047) | 1.325 (+-0.000) | 4.458 (+-0.010)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 4.279 (+-0.008) | 4.127 (+-0.010) | 6.598 (+-0.709) | 1.599 (+-0.000) | 5.064 (+-0.025)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.537 (+-0.010) | 4.593 (+-0.006) | 6.365 (+-0.104) | 1.386 (+-0.000) | 4.480 (+-0.011)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.411 (+-0.066) | 62.275 (+-0.436) | 64.486 (+-0.353) | 1.035 (+-0.000) | 26.210 (+-0.110)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.457 (+-0.096) | 72.887 (+-0.247) | 74.207 (+-0.337) | 1.018 (+-0.000) | 25.995 (+-0.120)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.457 (+-0.086) | 64.110 (+-0.233) | 66.340 (+-0.406) | 1.035 (+-0.000) | 26.145 (+-0.085)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.536 (+-0.094) | 73.742 (+-0.483) | 71.946 (+-0.460) | 0.976 (+-0.000) | 26.457 (+-0.166)
Times are in milliseconds (ms).
[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
| Eager (2.1.0a0+git1afae24) PR-afgg | Compiled (2.1.0a0+git1afae24) PR-afgg | Compiled (2.1.0a0+git16df542) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 91.971 (+-0.253) | 90.570 (+-0.193) | 137.206 (+-0.214) | 1.515 (+-0.000) | 84.280 (+-0.241)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 91.893 (+-0.361) | 89.866 (+-0.170) | 136.678 (+-0.471) | 1.521 (+-0.000) | 84.573 (+-0.214)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 116.967 (+-0.481) | 110.468 (+-0.326) | 223.770 (+-0.334) | 2.026 (+-0.000) | 108.098 (+-0.392)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 117.563 (+-0.546) | 111.438 (+-0.212) | 223.101 (+-0.350) | 2.002 (+-0.000) | 108.225 (+-0.395)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 80.706 (+-0.289) | 70.525 (+-0.204) | 143.697 (+-0.311) | 2.038 (+-0.000) | 74.485 (+-0.258)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 80.955 (+-0.208) | 69.986 (+-0.250) | 143.658 (+-0.244) | 2.053 (+-0.000) | 74.163 (+-0.238)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 117.576 (+-0.435) | 71.179 (+-0.412) | 178.515 (+-0.539) | 2.508 (+-0.000) | 108.394 (+-0.473)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 117.441 (+-0.205) | 70.313 (+-0.170) | 178.664 (+-0.555) | 2.541 (+-0.000) | 108.098 (+-0.416)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.962 (+-0.509) | 1740.964 (+-0.597) | 1785.401 (+-0.369) | 1.026 (+-0.000) | 92.638 (+-0.539)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.928 (+-0.493) | 1401.146 (+-0.732) | 1453.229 (+-0.628) | 1.037 (+-0.000) | 92.458 (+-0.428)
Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 118.152 (+-0.442) | 1740.644 (+-0.480) | 1793.475 (+-0.458) | 1.030 (+-0.000) | 107.962 (+-0.548)
Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 118.182 (+-0.425) | 1400.621 (+-0.624) | 1461.796 (+-0.630) | 1.044 (+-0.000) | 107.894 (+-0.994)
Times are in microseconds (us).
```
[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano