Token-Based Chunking Support (#4203)
## Summary
This PR adds token-based chunking support to `chunk_by_title()` and
`chunk_elements()` using tiktoken, allowing users to specify
`max_tokens` instead of `max_characters` for better alignment with LLM
token limits.
Closes #4127
## Changes
### New Parameters
| Parameter | Description |
|-----------|-------------|
| `max_tokens` | Hard maximum chunk token count (mutually exclusive with
`max_characters`) |
| `new_after_n_tokens` | Soft maximum - start new chunk after this many
tokens |
| `tokenizer` | Tokenizer name - accepts encoding names
(`"cl100k_base"`) or model names (`"gpt-4"`) |
### Implementation Details
- **TokenCounter class**: Lazy tiktoken integration - only imports
tiktoken when token counting is first used
- **Measurement abstraction**: Added `measure()` method to
`ChunkingOptions` that returns chars or tokens based on mode
- **Mutual exclusivity**: `max_tokens` and `max_characters` cannot be
used together
- **Token-based text splitting**: New `_split_by_tokens()` method uses
separator preferences with binary search fallback
### Files Changed
- `requirements/extra-chunking-tokens.in` - New tiktoken dependency
- `setup.py` - Added `chunking-tokens` extra
- `unstructured/chunking/base.py` - Core token-based chunking logic
- `unstructured/chunking/title.py` - Updated `chunk_by_title()`
signature
- `unstructured/chunking/basic.py` - Updated `chunk_elements()`
signature
- `test_unstructured/chunking/test_base.py` - Unit tests
- `test_unstructured/chunking/test_title.py` - Integration tests
## Usage
```python
from unstructured.chunking.title import chunk_by_title
# Token-based chunking (new)
chunks = chunk_by_title(
elements,
max_tokens=512,
new_after_n_tokens=400,
tokenizer="gpt-4" # or "cl100k_base"
)
# Character-based chunking (unchanged)
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000
)
```
## Installation
To use token-based chunking, install with the new extra:
```bash
pip install "unstructured[chunking-tokens]"
```