fix: remove duplicate characters caused by fake bold rendering in PDFs (#4215)
Closes #3864
## Summary
- Fixes issue where bold text in PDFs is extracted with duplicate
characters (e.g., "BOLD" → "BBOOLLDD")
- Some PDF generators simulate bold by rendering each character twice at
slightly offset positions
- Added character-level deduplication based on position proximity to
detect and remove these duplicates
## Problem
When extracting text from certain PDFs, bold text appears duplicated:
```python
# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text) # Output: ">60>60" instead of ">60"
```
## Solution
Added character-level deduplication that:
- Compares consecutive characters' text content and position
- Removes duplicates where same character appears within 3 pixels
(configurable)
- Preserves spaces and other non-character elements (LTAnno objects)
```python
# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text) # Output: ">60" ✓
```
## Configuration
```bash
# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0
# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0
# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0
```