Add VibeVoice Acoustic Tokenizer (#43400)
* Add vibevoice tokenizer files.
* Address style tests.
* Revert to expected outputs previously computed on runner.
* Enable encoder output test.
* Update expected output from runner
* Add note on expected outputs
* remove code link and better init
* Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* modular
* Same changes to decoder layers.
* Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* doc nits
* Use decoder_depths for decoder!
* Doc nits
* Nits
* Trim feature extraction for tensor only usage.
* Add cache logic to encoder.
* Nit
* Revert to previous sampling approach.
* Nits
* Better logic for vae sampling?
* More standard conversion script.
* Revert to sample flag
* Nits
* Docs, cleanup, nits.
* Nit
* Nit
* Skip parallelism
* Shift cache creation to when it's used.
* Updated checkpoint path
---------
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>