🔴🔴🔴 fix: skip `clean_up_tokenization` for BPE tokenizers in `PreTrainedTokenizerFast` (#44915)
* fix: skip clean_up_tokenization for BPE tokenizers
clean_up_tokenization applies BERT-era string replacements (` .` → `.`,
` !` → `!`, etc.) that are destructive for BPE tokenizers where spaces
are encoded as part of tokens. This adds a guard that skips the cleanup
when the backend model is BPE and emits a warning_once suggesting the
user set clean_up_tokenization_spaces=False.
Fixes #35175
* test: add test for BPE tokenizer skipping clean_up_tokenization
Verifies that BPE tokenizers preserve spaces before punctuation
even when clean_up_tokenization_spaces=True.
* fix: update tests to expect BPE cleanup skip
clean_up_tokenization is always skipped for BPE tokenizers, even when
explicitly requested, because the cleanup is fundamentally wrong for
BPE (it strips legitimate spaces that are part of the token encoding).
Users who need those string replacements can call
clean_up_tokenization() directly.
Updated test_tokenization_utils.py to expect preserved spacing for
GPT-2 (BPE). Added test in test_tokenization_fast.py verifying the
guard works with an explicit True parameter.
* fix: move BPE test to correct class, use clean roundtrip text
Move test_bpe_tokenizer_skips_clean_up_tokenization_spaces to
PreTrainedTokenizationFastTest (which has bytelevel_bpe_model_name).
Update test_clean_up_tokenization_spaces to use normal text without
artificial WordPiece artifacts — BPE roundtrip preserves originals.
* fix: add leading space to test string for ByteLevel BPE prefix
ByteLevel BPE tokenizers prepend a space during encoding. Use
" Hello world." so the roundtrip matches exactly.
* feat: add escape hatch for BPE cleanup override
Add clean_up_tokenization_spaces_even_though_its_wrong_for_bpe flag
so users who rely on the old behavior can opt back in. The warning
message now mentions this flag. Added test for the override path.
* test: add llama 3 regression test for BPE clean_up_tokenization_spaces skip
Loads the real Meta-Llama-3-8B tokenizer so the test exercises the actual
shipped config (`clean_up_tokenization_spaces=True` + BPE backend) — locks
in the fix against future reverts. Marked @slow so it only runs under
RUN_SLOW=1, consistent with other gated-repo tokenizer tests in this class.