pytorch
e994e783 - Added vectorized horizontal flip path for channels last for NcHW (#91806)

Commit View On GitHub

Commit

1 year ago

Added vectorized horizontal flip path for channels last for NcHW (#91806) ## Description - Added AVX2-only vectorization for horizontal flip op applied on channels last NCHW input, where **2 <= C * sizeof(dtype) <= 16**. PR is a bit faster than Pillow and largely faster (x2 - x5) than Nightly. - ~Still keeping `cpu_vflip_memcpy` code ([it's PR](https://github.com/pytorch/pytorch/pull/89414) was reverted and is under investigations)~ ## Benchmarks ``` [---------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------] | torch (2.0.0a0+gitf6d73f3) PR | Pillow (9.4.0) | torch (2.0.0a0+git4386f31) nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------- channels=2, size=256, dtype=torch.uint8, mf=channels_last | 31.859 (+-0.498) | | 190.599 (+-7.579) channels=2, size=520, dtype=torch.uint8, mf=channels_last | 60.648 (+-0.074) | | 706.895 (+-11.219) channels=2, size=712, dtype=torch.uint8, mf=channels_last | 95.994 (+-2.510) | | 1340.685 (+-169.279) channels=3, size=256, dtype=torch.uint8, mf=channels_last | 45.490 (+-0.108) | 47.359 (+-0.942) | 179.520 (+-2.916) channels=3, size=520, dtype=torch.uint8, mf=channels_last | 146.802 (+-2.175) | 174.201 (+-4.124) | 707.765 (+-2.691) channels=3, size=712, dtype=torch.uint8, mf=channels_last | 215.148 (+-0.925) | 313.606 (+-3.972) | 1346.678 (+-89.854) channels=3, size=256, dtype=torch.int8, mf=channels_last | 43.618 (+-0.160) | | 191.613 (+-16.252) channels=3, size=520, dtype=torch.int8, mf=channels_last | 147.487 (+-0.691) | | 755.020 (+-25.045) channels=3, size=712, dtype=torch.int8, mf=channels_last | 216.687 (+-0.906) | | 1314.854 (+-31.137) channels=4, size=256, dtype=torch.uint8, mf=channels_last | 32.169 (+-0.092) | | 195.415 (+-3.647) channels=4, size=520, dtype=torch.uint8, mf=channels_last | 89.465 (+-0.154) | | 776.459 (+-14.845) channels=4, size=712, dtype=torch.uint8, mf=channels_last | 152.773 (+-0.610) | | 1456.304 (+-45.280) channels=8, size=256, dtype=torch.uint8, mf=channels_last | 43.444 (+-0.158) | | 163.669 (+-4.580) channels=8, size=520, dtype=torch.uint8, mf=channels_last | 151.285 (+-0.602) | | 642.396 (+-13.500) channels=8, size=712, dtype=torch.uint8, mf=channels_last | 278.471 (+-0.912) | | 1205.472 (+-47.609) channels=16, size=256, dtype=torch.uint8, mf=channels_last | 75.176 (+-0.188) | | 181.278 (+-3.388) channels=16, size=520, dtype=torch.uint8, mf=channels_last | 291.105 (+-1.163) | | 716.906 (+-30.842) channels=16, size=712, dtype=torch.uint8, mf=channels_last | 893.267 (+-10.899) | | 1434.931 (+-40.399) channels=2, size=256, dtype=torch.int16, mf=channels_last | 31.437 (+-0.143) | | 195.299 (+-2.916) channels=2, size=520, dtype=torch.int16, mf=channels_last | 89.834 (+-0.175) | | 774.940 (+-8.638) channels=2, size=712, dtype=torch.int16, mf=channels_last | 154.806 (+-0.550) | | 1443.435 (+-37.799) channels=3, size=256, dtype=torch.int16, mf=channels_last | 70.909 (+-0.146) | | 195.347 (+-1.986) channels=3, size=520, dtype=torch.int16, mf=channels_last | 212.998 (+-1.181) | | 776.282 (+-15.598) channels=3, size=712, dtype=torch.int16, mf=channels_last | 382.991 (+-0.968) | | 1441.674 (+-9.873) channels=4, size=256, dtype=torch.int16, mf=channels_last | 43.574 (+-0.157) | | 163.176 (+-1.941) channels=4, size=520, dtype=torch.int16, mf=channels_last | 151.289 (+-0.557) | | 641.169 (+-9.457) channels=4, size=712, dtype=torch.int16, mf=channels_last | 275.275 (+-0.874) | | 1186.589 (+-12.063) channels=8, size=256, dtype=torch.int16, mf=channels_last | 74.455 (+-0.292) | | 181.191 (+-1.721) channels=8, size=520, dtype=torch.int16, mf=channels_last | 289.591 (+-1.134) | | 715.755 (+-2.368) channels=8, size=712, dtype=torch.int16, mf=channels_last | 923.831 (+-68.807) | | 1437.078 (+-14.649) channels=2, size=256, dtype=torch.int32, mf=channels_last | 44.217 (+-0.203) | | 163.011 (+-1.497) channels=2, size=520, dtype=torch.int32, mf=channels_last | 150.920 (+-0.950) | | 640.761 (+-1.882) channels=2, size=712, dtype=torch.int32, mf=channels_last | 281.648 (+-1.163) | | 1188.464 (+-10.374) channels=3, size=256, dtype=torch.int32, mf=channels_last | 103.708 (+-0.517) | | 165.001 (+-1.315) channels=3, size=520, dtype=torch.int32, mf=channels_last | 409.785 (+-8.004) | | 647.939 (+-11.431) channels=3, size=712, dtype=torch.int32, mf=channels_last | 790.819 (+-16.471) | | 1219.206 (+-9.503) channels=4, size=256, dtype=torch.int32, mf=channels_last | 72.975 (+-0.155) | | 181.298 (+-1.059) channels=4, size=520, dtype=torch.int32, mf=channels_last | 291.584 (+-0.905) | | 716.033 (+-4.824) channels=4, size=712, dtype=torch.int32, mf=channels_last | 938.790 (+-15.930) | | 1434.134 (+-15.060) Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/8e8c989d35835d7ab20567bff36632be#file-20230123-143303-pr_vs_nightly-md) ## Context: Follow-up work to PRs : https://github.com/pytorch/pytorch/pull/88989, https://github.com/pytorch/pytorch/pull/89414 and https://github.com/pytorch/pytorch/pull/90013 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91806 Approved by: https://github.com/peterbell10, https://github.com/lezcano

Author

vfdev-5

Committer

pytorchmergebot

Parents

a112814a

pytorch e994e783 - Added vectorized horizontal flip path for channels last for NcHW (#91806)

Commit

pytorch
e994e783 - Added vectorized horizontal flip path for channels last for NcHW (#91806)