[WebAssembly] Lower wide SIMD i8 muls (#130785)
Currently, 'wide' i32 simd multiplication, with extended i8 elements,
will perform the multiplication with i32 So, for IR like the following:
```
%wide.a = sext <8 x i8> %a to <8 x i32>
%wide.b = sext <8 x i8> %a to <8 x i32>
%mul = mul <8 x i32> %wide.a, %wide.b
ret <8 x i32> %mul
```
We would generate the following sequence:
```
i16x8.extend_low_i8x16_s $push6=, $1
local.tee $push5=, $3=, $pop6
i32x4.extmul_low_i16x8_s $push0=, $pop5, $3
v128.store 0($0), $pop0
i8x16.shuffle $push1=, $1, $1, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
i16x8.extend_low_i8x16_s $push4=, $pop1
local.tee $push3=, $1=, $pop4
i32x4.extmul_low_i16x8_s $push2=, $pop3, $1
v128.store 16($0), $pop2
return
```
But now we perform the multiplication with i16, resulting in:
```
i16x8.extmul_low_i8x16_s $push3=, $1, $1
local.tee $push2=, $1=, $pop3
i32x4.extend_high_i16x8_s $push0=, $pop2
v128.store 16($0), $pop0
i32x4.extend_low_i16x8_s $push1=, $1
v128.store 0($0), $pop1
return
```