hexagon: minor refresh for HMX FA and MM (llama/23796)
* hex-fa: clean up qf32/fp32 handling and stride handling
* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79
* hex-fa: vectorize leftover handling
* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity
* hmx-mm: remove dead code
* hmx-mm: use fastdiv in x4x2 dequant
* hmx-mm: sandwich dequant and scatter to improve perf
* hmx-mm: fixed rebase conflicts
* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv
* hmx-mm: an even earlier dispatch for per-type dequant
* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs
This is a bit faster than LUT.
* hex-cmake: one more tweak for lto
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>