[AMDGPU] Optimize fsub and fneg when packed fp32 ops are supported (#195962)
We should take advantage of v_pk_add_f32 to optimize fsub v2f32.
In addition, for fneg in wider vectors, we should split to v2f32
to match the source modifier for fadd v2f32.