hexagon: optimization for HMX mat_mul (#21554)
* hexagon: add async HMX worker
Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.
* hexagon: cost-based VTCM chunk search for out-stationary matmul
* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice
* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics
* hex-vmem: drop vmem limit a touch under 3GB on v73
* hexagon: add fwd declaration of htp_context
* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface
Simplifies the overall implemantion, reduces thread wakeup roundtrips.
* hex-mm: add debug log to hmx work func called from hmx-queue
* Update hmx-queue.h
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
---------
Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>