Qwen3 ASR and Forced Aligner (#43838)

Commit

7 days ago

Qwen3 ASR and Forced Aligner (#43838) * Create modular file and port processor Create tester class and test processor initialization * Test for pretrained, tokenizer and feature extractor * add ProcessorTesterMixin to test class create methods for common tests * add config classes * unable to pass test_apply_chat_template_audio, added debugging logic for now * Add model and config classes Create integration test Setup Qwen3ASRModelTester * Add attn_implementation to configs Add property methods to config Add base_model_prefix and wrapper method to generation class * Fix tests by removing attentions hook and manually calculating attention weights CLEANUP NEEDED * Change model 'attentions' hook class from Qwen3ASRThinkerTextAttention to Qwen3ASRTextAttention, Qwen3ASRThinkerTextAttention is never instantiated and so 'attentions' was not being properly propogated Fix integration tests * Architectural change inspired by test_generate_with_static_cache: Align RoPE position handling with cache_position Refactor position_ids construction to be fully cache_position-driven and generation-safe. - Compute batch_size/seq_length from inputs_embeds - Initialize cache_position when absent - Build 3D position_ids from cache_position - Compute rope_deltas once during prefill - Reuse rope_deltas for subsequent decode steps Removes legacy attention_mask-dependent branch that was incompatible with static cache generation. Ensures correct RoPE offsets for multimodal inputs under both dynamic and static cache modes. * Use modular transformers components to define Qwen3ASRAudioEncoderConfig * Use modular transformers to define Qwen3ASRTextConfig from Qwen3OmniMoeTextConfig * Comment about inherited class-level attributes for Qwen3ASRTextConfig * Use modular transformers to define Qwen3ASRThinkerConfig from Qwen3OmniMoeThinkerConfig * Remove comments * Use modular transformers to define Qwen3ASRConfig from Qwen3OmniMoeConfig (could have used Qwen3Config instead) * Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead of redefining * Use modular transformers to define Qwen3ASRProcessor from Qwen3OmniMoeProcessor (from_pretrained not working) * Change pipeline_model_mapping in model tests from 'automatic-speech-recognition' to 'audio-text-to-text' * Use modular transformers to define Qwen3ASRTextRMSNorm from Qwen3OmniMoeThinkerTextRMSNorm * Import rotate_half, repeat_kv, apply_rotary_pos_emb, eager_attention_forward from Qwen3-Omni-Moe instead of redefining * Use modular transformers to define Qwen3ASRTextAttention from Qwen3OmniMoeThinkerTextAttention (has to overwrite forward due to sliding_window argument in attention_interface) * Use modular transformers to define Qwen3ASRTextMLP from Qwen3OmniMoeThinkerTextMLP * Use modular transformers to define Qwen3ASRThinkerTextDecoderLayer from Qwen3OmniMoeThinkerTextDecoderLayer * Import _get_feat_extract_output_lengths from Qwen3-Omni-Moe instead of redefining * Use modular transformers to define Qwen3ASRPreTrainedModelForConditionalGeneration from Qwen3OmniMoePreTrainedModelForConditionalGeneration * Use modular transformers to define Qwen3ASRAudioAttention from Qwen3OmniMoeAudioAttention * Use modular transformers to define Qwen3ASRAudioEncoderLayer from Qwen3OmniMoeAudioEncoderLayer * Import SinusoidsPositionEmbedding from Qwen3-Omni-Moe instead of redefining * Use modular transformers to define Qwen3ASRAudioEncoder from Qwen3OmniMoeAudioEncoder * Use modular transformers to define Qwen3ASRThinkerTextRotaryEmbedding from Qwen3OmniMoeThinkerTextRotaryEmbedding Chose to keep compute_default_rope_parameters despite it not originally being in Qwen3ASR * Use modular transformers to define Qwen3ASRThinkerTextMLP directly from Qwen3OmniMoeThinkerTextMLP * Use modular transformers to define Qwen3ASRThinkerTextRMSNorm directly from Qwen3OmniMoeThinkerTextRMSNorm * Use modular transformers to define Qwen3ASRThinkerTextModel from Qwen3OmniMoeThinkerTextModel * Use modular transformers to define Qwen3ASRThinkerForConditionalGeneration from Qwen3OmniMoeThinkerForConditionalGeneration Chose not to inherit get_audio_features because the outputs are of different type and the modular converter does not supporting unravelling 'audio_outputs = super().get_audio_features()' * Update Qwen3ASRTextConfig modular according to convention. * Nits * Change Qwen3ASRProcessor inheritance from Qwen3OmniMoeProcessor to AudioFlamingo3Processor - init no longer has to be overwritten * Comment about ThinkerConfig inheritance * Change Qwen3ASRProcessor to inherit directly - init no longer has to be overwritten * Remove torch.manual_seed from integration tests * Style: fix ruff lint issues and typing compliance * Add reproducer to programmatically update expected results for integration tests, link to external gist in comments * Add convert_qwen3_asr_to_hf.py * Remove Qwen3OmniMoeConfig inheritance from Qwen3ASRConfig * Remove Qwen3OmniMoeThinkerConfig inheritance from Qwen3ASRThinkerConfig * cleanup * Cleanup * Cleanup * Cleanup * Cleanup * Functional model conversion. * Cleanup * Cleanup * Cleanup * Cleanup * Add init_weights to Qwen3ASRPreTrainedModel to pass ModelTesterMixin::test_init_weights_can_init_buffers * Cleanup * Cleanup * Cleanup * Use converted hf weights for integration tests * Change Processor tests to use hf checkpoint * Restore CI/github scripts to upstream versions * Restore CI/github scripts to upstream versions (2) * Restore CI/github scripts to upstream versions (3) * passing integration tests * Standardize processor. * Cleanup and standardize modeling. * Remove rope deltas. * Stop tracking reproducer. * Update config modular. * Account for n_window in encoder length computation. * Add qwen3asr * Nit * Expose encoder from qwen3 omni, and cleaner modular. * DIrectly use language model from Qwen3. * Modular from other audio LMs. * Shift flattening to processor. * Add docs and post-process methods. * Address model integration tests + style * Processing tests. * Functional forced alignment in a single modular. * Add reproducer for timestamps. * Remove processor from modular. * Create base Qwen3ASR model like Llava. * Push timestamp fixtures. * Nits and style. * Forced aligner refactor: new auto class and better naming. * Forced alignmnet nits. * Create audio encoder that is more in line with other and torch compile compatible! * Small fixes for tests. * add torch compil forced aligner example, and small fix for compile * Modeling nits. * undo exposure of omni audio encoder, doc/style nits * Add note on attention's k_proj bias. * Cleaner init. * Apply suggestion from @vasqu Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Apply suggestion from @vasqu Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Doc improvements, and conversion fix. * Simplify conversion script. * Apply suggestion from @vasqu Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Apply suggestion from @vasqu Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Better encoder config in modular. * Add default method to SinusoidsPositionEmbedding, and generate from modular. * Refactor forced aligner. Use GenericForTokenClassification. * Address processor comments. * Add support for language codes. * Address comments for token classification. * Better modular for attention and token classification. * Modular after merge. * Use new ALM testing classes. * Update src/transformers/models/qwen3_asr/feature_extraction_qwen3_asr.py Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com> * Address review comments: create make_list_of_audio_chat_template util, improve qwen3 asr modular. * Modular after merge. * Address unprotected torch import. * Introduce score_bias for GenericForTokenClassification. * Refactor token classification bias. * Refactor processsing like AudioFlamingo3 with submethods. * Use windowed attention like in Qwen 3 Omni. * Add multimodal projector, and small refactor. * Better max_source_positions, style fixes. * Update modular after ALM refactor. * check repo * Apply post-processing like original implementation. * Set default max new tokens like original, and nits. * Zero pad to min length like original * Remove padding mask update for min length (like original) * Refactor, and update padding mask. * revert mask update, hurts AMI performance * feature extractor nits * Renaming with hf suffix. * address comments * Use common util for floats_list * Prepare for new checkpoints. * Processor tests, loading fix. * Rename file according to others. * shorter file for tests * leaner processor tests --------- Co-authored-by: mbtariq82 <mbtariq82@gmail.com> Co-authored-by: Eric B <ebezzam@gmail.com> Co-authored-by: Eric Bezzam <4757445+ebezzam@users.noreply.github.com> Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

References

#43838 - Qwen3 ASR and Forced Aligner

Author

mbtariq82

Parents

1ce2a491

transformers 96720392 - Qwen3 ASR and Forced Aligner (#43838)

transformers
96720392 - Qwen3 ASR and Forced Aligner (#43838)