use 4 warps for small block config in mm (#95339)
Temporary Fix for #95312
In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway.
Triton main is supposed to have better behavior for these pathological, but we are not on main yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95339
Approved by: https://github.com/ezyang, https://github.com/Chillee