Fix Qwen3-Omni inference when mixing video and image inputs in one batch (#41741)
* Fix qwen3omni inference when mixing video and image inputs in one batch
* Fix `router_aux_loss_coef`
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>