Add Parakeet (#39062)
* first commit
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update to handle masking for bs>1
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* Add tests and docs
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update model ids
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update docs and improve style
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update librosa location
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* import guard torch too
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* ruff code checks fix
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* ruff format check
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* updated to parakeet names
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update script
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* Add tokenizer decoding
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* Remove other model dependency
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* clean tests
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix tests
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* linting
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix ruff lint warnings
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* move to seperate folders
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* add parakeet ctc model code
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* simplify encoder structure
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* update documentation
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* add parakeet to toctree
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix tests
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* add parakeet doc
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* Address comments
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* Update featurizer to compute lens directly
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix ruff tests
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix encoding format
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* fix minor ctc decoding
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
* revert modular_model_converter.py changes
* revert check_config_attributes.py changes
* refactor: fastconformer & parakeet_ctc -> parakeet
* modeling update
* test update
* propagate feature extractor updates
* propagate doc changes
* propagate doc changes
* propagate tokenization changes
* propagate conversion changes
* remove fastconformer tests
* remove modular
* update processor
* update processor
* tset update
* diverse fixes
* 100% macthing greedy batched
* Update conversion script.
* Refactor docs.
* Reafactor auto loading.
* Refactor and fix tokenization and processing.
* Update integration test.
* Modeling fixes:
- ensure correct attention mask shape
- ensure layer drop returns valid output
- correct blank token ID when computing CTC loss
* Format and repo consistency.
* Update model doc.
* Fix feature extraction tests.
* Fix (most) tokenizer tests.
* Add pipeline example.
* Fixes
* Use eager_attention_forward from Llama.
* Small tweaks.
* Replace Sequential with ModuleList
* Add check if not all layers copied
* Clean tokenizer.
* Standardize FastSpeech2ConformerConvolutionModule for Parakeet.
* Switch to modular for modeling and processing.
* Add processor tests.
* Fix modeling tests.
* Formating and docstrings.
* Add `return_attention_mask` like other feature extractors.
* clean up after merging main.
* nits on modeling
* configuration update
* nit
* simplification: use PretrainedTokenizerFast, simplify processor
* add dtype arg to mel_filter_bank
* feature extraction: simplify!
* modeling update
* change to ParakeetTokenizerFast
* correct attention mask handling
* auto update
* proc update
* test update
* feature extraction fixes
* modeling update
* conversion script update
* udpate tests feature integration
* update tokenization and tests
* processor tests
* revert audio_utils
* config docstring update
* blank_token -> pad_token
* modeling udpate
* doc update
* fix tests
* fix test
* fix tests
* address review comments
* add comment
* add comment
* explicitly not support flash
* atttention straightforward masking
* fix
* tokenizer update: skipping blank tokens by default
* doc update
* fix max_positions_embeddings handling
* nits
* change atol faeture extraction integration tests
* doc update + fix loss
* doc update
* nit
* update integration test for A10
* repo id name
* nit
---------
Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
Co-authored-by: Eric B <ebezzam@gmail.com>