Don't split 256-bit AVX2 load/store intrinsics (#20609)
Summary:
Recent versions of GCC split unaligned load and store intrinsics into
two 128-bit instructions. On old processors (Sandy Bridge) this was a
bit faster for unaligned data, but bit slower for aligned data. On new
processors (Intel Haswell+, recent AMD) splitting loads is slower on
both aligned and unaligned data.
Clang, MSVC, and ICC do not split unaligned load and store intrinsics.
There's a good explanation here:
https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top
Splitting load and store intrinsics makes no sense in our AVX2
configuration because the CPUs that support AVX2 instructions are the
same CPUs where splitting is disadvantageous on all data alignemnt.
Note that this doesn't change the AVX configuration (used by CPUs that
support AVX but not AVX2). It's possible this would be benficial for
that configuration too (our data is usually 32-byte aligned), but I'd
prefer the conservative change for now.
torch.add generated assembly (hot loop) (GCC 7.3.0)
before:
https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295
after:
https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91
Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell,
no turbo):
before: 7.35 us after: 6.39 us
(Take the torch.add timings with a grain of salt. The difference in timings
is much larger than I would expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20609
Differential Revision: D15385800
Pulled By: colesbury
fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f