langchain
51894ddd - allow tokentextsplitters to use model name to select encoder (#2963)

Commit

2 years ago

allow tokentextsplitters to use model name to select encoder (#2963) Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)

Author

timothyasp

Parents

706ebd8f

langchain 51894ddd - allow tokentextsplitters to use model name to select encoder (#2963)

langchain
51894ddd - allow tokentextsplitters to use model name to select encoder (#2963)