HIP: add fattn-mma-f16 for RDNA4 (#18481)
* finish VQ mma
* flash_attn_ext_f16_iter
* KQ_rowsum
* correct exp
* fix scale error
* fix softmax scale
* fix softmax scale
* enable fattn on cpu side
* fix random error
* disable fattn-mma-f16 on rdna3
* fix wrong col for rdna
* use identity mat to transpose
* resolve conflicts
* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B
* fix volta compile error
* align rdna4 policy for fattn
* adjust fattn policy
* adjust kernel selection logic
* update as the review comments
* keep fattn-wmma logic
* adjust kernel selection logic
---------
Co-authored-by: zhang hui <you@example.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>