opencl: flash attention improvement (#25069)
* opencl: rework FA kernel for f16 and f32
* opencl: flash-attention prefill prepass kernels
- flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
- flash_attn_mask_pad_f16 pads the matching mask tile
- flash_attn_blk_f16 classifies each KV tile per query block as
fully masked / mixed / fully unmasked, so
the main kernel can skip fully-masked tiles
and the mask lookup for fully-unmasked ones
* opencl: FA kernels for q4_0 and q8_0
* opencl: `set_rows` for f32 to q8_0/q4_0
* opencl: dequant kernels for q4_0 and q8_0
* opencl: add FA tile tuning table with override
* opencl: wire host side for FA
* opencl: q4_0 MoE tensors are also SOA'ed
* opencl: cosmetic fix
* opencl: refactor, also clarify some code paths in comments
* opencl: fix inifity for `-cl-finite-math-only`
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>