Universal Speculative Decoding `CandidateGenerator` (#35029)
* move `TestAssistedCandidateGeneratorDifferentTokenizers` into a new testing file
* refactor
* NOTHING. add space to rerun github actions tests
* remove it...
* `UniversalSpeculativeDecodingGenerator`
* Use `UniversalSpeculativeDecodingGenerator` when `generation_config.do_sample=True`
* assistant tokenizes only the target's new suffix
* formatting
* fix code
* fix code
* formatting
* add `TestGenerateWithDifferentModels`
* `TestGenerateWithDifferentModels` parameterize on `do_sample`
* `AssistantVocabMapping` & `AssistantVocabMappingCache`
* formatting
* `AssistantToTargetTranslator`: `get_target_input_ids` & `get_target_logits`
* improve `_get_assistant_to_target_input_ids` & formatting
* renaming
* WIP: debugging `min_new_tokens`
* fix get_target_ids
* `UniversalSpeculativeDecodingGenerator`
* assistant tokenizes only the target's new suffix
* formatting
* fix code
* fix code
* formatting
* `TestGenerateWithDifferentModels` parameterize on `do_sample`
* `AssistantVocabMapping` & `AssistantVocabMappingCache`
* formatting
* `AssistantToTargetTranslator`: `get_target_input_ids` & `get_target_logits`
* improve `_get_assistant_to_target_input_ids` & formatting
* renaming
* WIP: debugging `min_new_tokens`
* fix get_target_ids
* fix device issue
* fix get_assistant_input_ids
* add `TestAssistedCandidateGeneratorDifferentTokenizers`
* formatting
* `AssistantVocabTranslatorCache` refactor & tests
* revert changes in `src/transformers/generation/logits_process.py`
* refactor `AssistedCandidateGenerator`
* refactor `AssistedCandidateGeneratorDifferentTokenizers`
* formatting
* refactor `UniversalSpeculativeDecodingGenerator`
* fix negative value for max_new_tokens
* fix generation length target + attention_mask vs. assistant + attent
* fix device
* fix negative max_new_tokens bug
* fix UAG
* minor
* formatting
* `AssistedCandidateGeneratorDifferentTokenizers` `lookbehind`s init
* resolve conflict & formatting
* rerun CI tests
* remove space...
* remove old code
* fix candidate_input_ids device
* minor
* formatting
* Fix prepare + apply (#7)
* fix prepare + apply
* move to cpu
* simplity suppress_tokens
* fix bugs and refacatoring
* device move
* handle self.config.vocab_size > len(target_tokenizer.get_vocab())
* no need to normalize in candidate_generator
* address Nadav's comments + minor
* optimize device move + SuppressTokensLogitsProcessor
* AssistantToTargetTranslator, SuppressTokensLogitsProcessor and tokenizers mapping improvements
* padding size
* padding improvement
* fix and simplify get_target_logits
* renaming in get_target_logits
* minor
* add filter_value and suppress_tokens_id
* style + rename
* remove TODO
* restore original SelectTokensLogitsProcessor with modification
* fix style
* fix _update_past_and_masks and optimize code
* remove assistant_vocab_size arg
* fix attention_mask
* call _prepare_attention_mask also if not has_past_key_values
* handling attention mask for first generation
* comment
* restore test
* remove SelectTokensLogitsProcessor
* _update_past_and_masks implementation for USD
* Add unittests for Universal Assisted generation
* fix style
* update tests
* Remove unused import and fix `test_speculation_depth` test
* exclude special and reserved tokens from tokenizer for UAG
* mv `test_universal_assisted_generation.py` to `generation/test_candidate_generator.py`
* Remove unused imports and fix style using `make style` (#9)
* formatting
* Swap gated `meta-llama/llama-3.2` with `allenai/llama` (#10)
* Fix space sign disagreement (#12)
* default values for AssistantToTargetTranslator fileds
* fix space sign
* minor
* fix test + style
* Default values for some fields of assistant to target translator (#11)
* default values for AssistantToTargetTranslator fileds
* fix
* add support to empty logit_processors
* Update candidate_generator.py (#15)
fix typo
* BUG fix in _prepare_assistant_input_ids (#14)
* fix _prepare_assistant_input_ids
* target_to_assistant_input_ids
* Update src/transformers/generation/candidate_generator.py
Co-authored-by: Nadav Timor <nadav.timor@weizmann.ac.il>
---------
Co-authored-by: Nadav Timor <nadav.timor@weizmann.ac.il>
* typo (`target_to_assistant_input_ids`)
* formatting
* merge upstream/main
* Fix minor review comments (#16)
* Fix: `token_ids.to(torch.int64)` (#18)
* tok ids to `torch.int64` (reference: https://huggingface.co/docs/transformers.js/en/api/tokenizers)
* `LongTensor`
* fix dtype
* `assistant_input_ids.to(dtype=torch.long)`
* Remove unused import from test_candidate_generator.py
* Remove unused import from test_candidate_generator.py
* Remove `numpy` import
* resolve pr comments (#19)
* `AssistantToTargetTranslator` docstring
* (per gante's comment) `filter_value` and `suppress_tokens_id` to class constants
* update `AssistantToTargetTranslator` docstring
* (gante's comment) replace `match-case`
* formatting
* Fix Joao's comments (#21)
* remove threading
* fix logits_processor
* fix test device
* fix style (#23)
* Move atm (#24)
* move AssistantToTargetTranslator
* fixup
* fix logit_processor
* add atm_translator test
* refactor test
* remove threading from test
* add require_torch in tests
* move AssistantVocabTranslatorCache + add tests
* ruff fix
---------
Co-authored-by: jmamou <jonathan.mamou@intel.com>
Co-authored-by: Gaurav <gauravj@d-matrix.ai>
Co-authored-by: Gaurav Jain <gaurjain14@gmail.com>
Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>