transformers
5739726f - fix: Text splitting in the BasicTokenizer (#22280)

Commit
2 years ago
fix: Text splitting in the BasicTokenizer (#22280) * fix: Apostraphe splitting in the BasicTokenizer for CLIPTokenizer * account for apostrophe at start of new word * remove _run_split_on_punc, use re.findall instead * remove debugging, make style and quality * use pattern and punc splitting, repo-consistency will fail * remove commented out debugging * adds bool args to BasicTokenizer, remove pattern * do_split_on_punc default True * clean stray comments and line breaks * rebase, repo-consistency * update to just do punctuation split * add unicode normalizing back * remove redundant line
Author
Connor Henderson
Parents
Loading