Token-Based Chunking Support (#4203)

Commit

100 days ago

Token-Based Chunking Support (#4203) ## Summary This PR adds token-based chunking support to `chunk_by_title()` and `chunk_elements()` using tiktoken, allowing users to specify `max_tokens` instead of `max_characters` for better alignment with LLM token limits. Closes #4127 ## Changes ### New Parameters | Parameter | Description | |-----------|-------------| | `max_tokens` | Hard maximum chunk token count (mutually exclusive with `max_characters`) | | `new_after_n_tokens` | Soft maximum - start new chunk after this many tokens | | `tokenizer` | Tokenizer name - accepts encoding names (`"cl100k_base"`) or model names (`"gpt-4"`) | ### Implementation Details - **TokenCounter class**: Lazy tiktoken integration - only imports tiktoken when token counting is first used - **Measurement abstraction**: Added `measure()` method to `ChunkingOptions` that returns chars or tokens based on mode - **Mutual exclusivity**: `max_tokens` and `max_characters` cannot be used together - **Token-based text splitting**: New `_split_by_tokens()` method uses separator preferences with binary search fallback ### Files Changed - `requirements/extra-chunking-tokens.in` - New tiktoken dependency - `setup.py` - Added `chunking-tokens` extra - `unstructured/chunking/base.py` - Core token-based chunking logic - `unstructured/chunking/title.py` - Updated `chunk_by_title()` signature - `unstructured/chunking/basic.py` - Updated `chunk_elements()` signature - `test_unstructured/chunking/test_base.py` - Unit tests - `test_unstructured/chunking/test_title.py` - Integration tests ## Usage ```python from unstructured.chunking.title import chunk_by_title # Token-based chunking (new) chunks = chunk_by_title( elements, max_tokens=512, new_after_n_tokens=400, tokenizer="gpt-4" # or "cl100k_base" ) # Character-based chunking (unchanged) chunks = chunk_by_title( elements, max_characters=1500, new_after_n_chars=1000 ) ``` ## Installation To use token-based chunking, install with the new extra: ```bash pip install "unstructured[chunking-tokens]" ```

References

#4203 - Token-Based Chunking Support

Author

eureka0928

Parents

c0323a61

unstructured 01c3f7c2 - Token-Based Chunking Support (#4203)

unstructured
01c3f7c2 - Token-Based Chunking Support (#4203)