[X86] Fold add(psadbw(X,0),psadbw(Y,0)) -> psadbw(add(X,Y),0)
If the vXi8 add(X,Y) is guaranteed not to overflow then we can push the addition though the psadbw nodes (being used for reduction) and only need a single psadbw node.
Noticed while working on CTPOP reduction codegen