transformers
bbb51c83 - 🔴🔴🔴 fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast` (#44915)

Commit
19 days ago
🔴🔴🔴 fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast` (#44915) * fix: skip clean_up_tokenization for BPE tokenizers clean_up_tokenization applies BERT-era string replacements (` .` → `.`, ` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces are encoded as part of tokens. This adds a guard that skips the cleanup when the backend model is BPE and emits a warning_once suggesting the user set clean_up_tokenization_spaces=False. Fixes #35175 * test: add test for BPE tokenizer skipping clean_up_tokenization Verifies that BPE tokenizers preserve spaces before punctuation even when clean_up_tokenization_spaces=True. * fix: update tests to expect BPE cleanup skip clean_up_tokenization is always skipped for BPE tokenizers, even when explicitly requested, because the cleanup is fundamentally wrong for BPE (it strips legitimate spaces that are part of the token encoding). Users who need those string replacements can call clean_up_tokenization() directly. Updated test_tokenization_utils.py to expect preserved spacing for GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the guard works with an explicit True parameter. * fix: move BPE test to correct class, use clean roundtrip text Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name). Update test_clean_up_tokenization_spaces to use normal text without artificial WordPiece artifacts — BPE roundtrip preserves originals. * fix: add leading space to test string for ByteLevel BPE prefix ByteLevel BPE tokenizers prepend a space during encoding. Use " Hello world." so the roundtrip matches exactly. * feat: add escape hatch for BPE cleanup override Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag so users who rely on the old behavior can opt back in. The warning message now mentions this flag. Added test for the override path. * test: add llama 3 regression test for BPE clean_up_tokenization_spaces skip Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks in the fix against future reverts. Marked @slow so it only runs under RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.
Parents
Loading