[msan] Fix handling of 256-bit hadd/hsub instructions (#168121)
These horizontal add/sub instructions are currently handled by
adding/subtracting tuples of the first operand, followed by tuples of
the second operand. This is not the correct semantics for the 256-bit
insructions: they process the first half of the first operand, then the
first half of the second operand, then the second half of the first
operand, and finally the second half of the second operand (trust me bro
[*]).
This patch fixes the issue by applying the "shards" functionality that
was added in https://github.com/llvm/llvm-project/pull/167954, to handle
the top and bottom 128-bit "shards" in turn.
[*] clang/test/CodeGen/X86/avx2-builtins.c:
```
TEST_CONSTEXPR(match_v8si(_mm256_hadd_epi32(
(__m256i)(__v8si){10, 20, 30, 40, 50, 60, 70, 80},
(__m256i)(__v8si){5, 15, 25, 35, 45, 55, 65, 75}),
30,70,20,60,110,150,100,140));
```