unstructured
8096b5af - fix: remove duplicate characters caused by fake bold rendering in PDFs (#4215)

Commit
42 days ago
fix: remove duplicate characters caused by fake bold rendering in PDFs (#4215) Closes #3864 ## Summary - Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD") - Some PDF generators simulate bold by rendering each character twice at slightly offset positions - Added character-level deduplication based on position proximity to detect and remove these duplicates ## Problem When extracting text from certain PDFs, bold text appears duplicated: ```python # Before fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60>60" instead of ">60" ``` ## Solution Added character-level deduplication that: - Compares consecutive characters' text content and position - Removes duplicates where same character appears within 3 pixels (configurable) - Preserves spaces and other non-character elements (LTAnno objects) ```python # After fix elements = partition_pdf("document.pdf", strategy="fast") print(elements[0].text) # Output: ">60" ✓ ``` ## Configuration ```bash # Default: 3.0 pixels (enabled) export PDF_CHAR_DUPLICATE_THRESHOLD=3.0 # Disable deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=0 # More aggressive deduplication export PDF_CHAR_DUPLICATE_THRESHOLD=5.0 ```
Author
Parents
Loading