Qwen3 ASR and Forced Aligner (#43838)
* Create modular file and port processor
Create tester class and test processor initialization
* Test for pretrained, tokenizer and feature extractor
* add ProcessorTesterMixin to test class
create methods for common tests
* add config classes
* unable to pass test_apply_chat_template_audio, added debugging logic for now
* Add model and config classes
Create integration test
Setup Qwen3ASRModelTester
* Add attn_implementation to configs
Add property methods to config
Add base_model_prefix and wrapper method to generation class
* Fix tests by removing attentions hook and manually calculating attention weights
CLEANUP NEEDED
* Change model 'attentions' hook class from Qwen3ASRThinkerTextAttention to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated
Fix integration tests
* Architectural change inspired by test_generate_with_static_cache: Align RoPE position handling with cache_position
Refactor position_ids construction to be fully cache_position-driven and generation-safe.
- Compute batch_size/seq_length from inputs_embeds
- Initialize cache_position when absent
- Build 3D position_ids from cache_position
- Compute rope_deltas once during prefill
- Reuse rope_deltas for subsequent decode steps
Removes legacy attention_mask-dependent branch that was incompatible with static cache generation.
Ensures correct RoPE offsets for multimodal inputs under both dynamic and static cache modes.
* Use modular transformers components to define Qwen3ASRAudioEncoderConfig
* Use modular transformers to define Qwen3ASRTextConfig from Qwen3OmniMoeTextConfig
* Comment about inherited class-level attributes for Qwen3ASRTextConfig
* Use modular transformers to define Qwen3ASRThinkerConfig from Qwen3OmniMoeThinkerConfig
* Remove comments
* Use modular transformers to define Qwen3ASRConfig from Qwen3OmniMoeConfig (could have used Qwen3Config instead)
* Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead of redefining
* Use modular transformers to define Qwen3ASRProcessor from Qwen3OmniMoeProcessor (from_pretrained not working)
* Change pipeline_model_mapping in model tests from 'automatic-speech-recognition' to 'audio-text-to-text'
* Use modular transformers to define Qwen3ASRTextRMSNorm from Qwen3OmniMoeThinkerTextRMSNorm
* Import rotate_half, repeat_kv, apply_rotary_pos_emb, eager_attention_forward from Qwen3-Omni-Moe instead of redefining
* Use modular transformers to define Qwen3ASRTextAttention from Qwen3OmniMoeThinkerTextAttention (has to overwrite forward due to sliding_window argument in attention_interface)
* Use modular transformers to define Qwen3ASRTextMLP from Qwen3OmniMoeThinkerTextMLP
* Use modular transformers to define Qwen3ASRThinkerTextDecoderLayer from Qwen3OmniMoeThinkerTextDecoderLayer
* Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead of redefining
* Use modular transformers to define Qwen3ASRPreTrainedModelForConditionalGeneration from Qwen3OmniMoePreTrainedModelForConditionalGeneration
* Use modular transformers to define Qwen3ASRAudioAttention from Qwen3OmniMoeAudioAttention
* Use modular transformers to define Qwen3ASRAudioEncoderLayer from Qwen3OmniMoeAudioEncoderLayer
* Import SinusoidsPositionEmbedding from Qwen3-Omni-Moe instead of redefining
* Use modular transformers to define Qwen3ASRAudioEncoder from Qwen3OmniMoeAudioEncoder
* Use modular transformers to define Qwen3ASRThinkerTextRotaryEmbedding from Qwen3OmniMoeThinkerTextRotaryEmbedding
Chose to keep compute_default_rope_parameters despite it not originally being in Qwen3ASR
* Use modular transformers to define Qwen3ASRThinkerTextMLP directly from Qwen3OmniMoeThinkerTextMLP
* Use modular transformers to define Qwen3ASRThinkerTextRMSNorm directly from Qwen3OmniMoeThinkerTextRMSNorm
* Use modular transformers to define Qwen3ASRThinkerTextModel from Qwen3OmniMoeThinkerTextModel
* Use modular transformers to define Qwen3ASRThinkerForConditionalGeneration from Qwen3OmniMoeThinkerForConditionalGeneration
Chose not to inherit get_audio_features because the outputs are of different type and the modular converter does not supporting unravelling 'audio_outputs = super().get_audio_features()'
* Update Qwen3ASRTextConfig modular according to convention.
* Nits
* Change Qwen3ASRProcessor inheritance from Qwen3OmniMoeProcessor to AudioFlamingo3Processor - init no longer has to be overwritten
* Comment about ThinkerConfig inheritance
* Change Qwen3ASRProcessor to inherit directly - init no longer has to be overwritten
* Remove torch.manual_seed from integration tests
* Style: fix ruff lint issues and typing compliance
* Add reproducer to programmatically update expected results for integration tests, link to external gist in comments
* Add convert_qwen3_asr_to_hf.py
* Remove Qwen3OmniMoeConfig inheritance from Qwen3ASRConfig
* Remove Qwen3OmniMoeThinkerConfig inheritance from Qwen3ASRThinkerConfig
* cleanup
* Cleanup
* Cleanup
* Cleanup
* Cleanup
* Functional model conversion.
* Cleanup
* Cleanup
* Cleanup
* Cleanup
* Add init_weights to Qwen3ASRPreTrainedModel to pass ModelTesterMixin::test_init_weights_can_init_buffers
* Cleanup
* Cleanup
* Cleanup
* Use converted hf weights for integration tests
* Change Processor tests to use hf checkpoint
* Restore CI/github scripts to upstream versions
* Restore CI/github scripts to upstream versions (2)
* Restore CI/github scripts to upstream versions (3)
* passing integration tests
* Standardize processor.
* Cleanup and standardize modeling.
* Remove rope deltas.
* Stop tracking reproducer.
* Update config modular.
* Account for n_window in encoder length computation.
* Add qwen3asr
* Nit
* Expose encoder from qwen3 omni, and cleaner modular.
* DIrectly use language model from Qwen3.
* Modular from other audio LMs.
* Shift flattening to processor.
* Add docs and post-process methods.
* Address model integration tests + style
* Processing tests.
* Functional forced alignment in a single modular.
* Add reproducer for timestamps.
* Remove processor from modular.
* Create base Qwen3ASR model like Llava.
* Push timestamp fixtures.
* Nits and style.
* Forced aligner refactor: new auto class and better naming.
* Forced alignmnet nits.
* Create audio encoder that is more in line with other and torch compile compatible!
* Small fixes for tests.
* add torch compil forced aligner example, and small fix for compile
* Modeling nits.
* undo exposure of omni audio encoder, doc/style nits
* Add note on attention's k_proj bias.
* Cleaner init.
* Apply suggestion from @vasqu
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Apply suggestion from @vasqu
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Doc improvements, and conversion fix.
* Simplify conversion script.
* Apply suggestion from @vasqu
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Apply suggestion from @vasqu
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Better encoder config in modular.
* Add default method to SinusoidsPositionEmbedding, and generate from modular.
* Refactor forced aligner. Use GenericForTokenClassification.
* Address processor comments.
* Add support for language codes.
* Address comments for token classification.
* Better modular for attention and token classification.
* Modular after merge.
* Use new ALM testing classes.
* Update src/transformers/models/qwen3_asr/feature_extraction_qwen3_asr.py
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Address review comments: create make_list_of_audio_chat_template util, improve qwen3 asr modular.
* Modular after merge.
* Address unprotected torch import.
* Introduce score_bias for GenericForTokenClassification.
* Refactor token classification bias.
* Refactor processsing like AudioFlamingo3 with submethods.
* Use windowed attention like in Qwen 3 Omni.
* Add multimodal projector, and small refactor.
* Better max_source_positions, style fixes.
* Update modular after ALM refactor.
* check repo
* Apply post-processing like original implementation.
* Set default max new tokens like original, and nits.
* Zero pad to min length like original
* Remove padding mask update for min length (like original)
* Refactor, and update padding mask.
* revert mask update, hurts AMI performance
* feature extractor nits
* Renaming with hf suffix.
* address comments
* Use common util for floats_list
* Prepare for new checkpoints.
* Processor tests, loading fix.
* Rename file according to others.
* shorter file for tests
* leaner processor tests
---------
Co-authored-by: mbtariq82 <mbtariq82@gmail.com>
Co-authored-by: Eric B <ebezzam@gmail.com>
Co-authored-by: Eric Bezzam <4757445+ebezzam@users.noreply.github.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>