unstructured
8f683e5c - fix: add EN DASH to UNICODE_BULLETS for clean_bullets (#4186)

Commit
29 days ago
fix: add EN DASH to UNICODE_BULLETS for clean_bullets (#4186) # Fix: Add EN DASH to clean_bullets Fixes #4105 ## What's the problem? When using [clean_bullets()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:36:0-48:31), the EN DASH character (`–`, `\u2013`) isn't recognized as a bullet point. This is a problem because some PDFs use EN DASH as bullet markers. Currently, users have to call [clean_dashes()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:336:0-344:50) as a workaround, but that removes *all* EN DASHes in the text—not just the ones at the start of lines. ## The fix Added `\u2013` (EN DASH) to the `UNICODE_BULLETS` list in [patterns.py](cci:7://file:///root/74/silver/unstructured/unstructured/nlp/patterns.py:0:0-0:0). Now [clean_bullets()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:36:0-48:31) handles EN DASH bullets the same way it handles other bullet characters. ## Testing Added test cases to verify: - EN DASH at the start of text is cleaned - EN DASH in the middle of text is preserved - Existing bullet types still work All tests pass. --- Contribution by Gittensor, see my contribution statistics at https://gittensor.io/miners/details?githubId=94194147
Author
Parents
Loading