[RISCV] Further improved exact VLEN lowering for mul reductions (#192688)
This is a follow up to 973a05ed. When we have a horizontal multiply
reduction at high LMUL and we have exact knowledge of VLEN, we can
extract the individual m1 sub-vectors and perform the entire reduction
tree at m1. This reduces the work performed (by not performing high LMUL
operations on a vectors with empty tails), and decreases register
pressure. Interestingly, we don't even increase dynamic instruction
count as the register alignment of the original LMUL forced the use of
whole register moves in the tree reduction anyways. (In the non-exact
case, these are vslidedown instructions, and are required.)
Originally written by Claude Code, heavily revised by me.