fix: add EN DASH to UNICODE_BULLETS for clean_bullets (#4186)
# Fix: Add EN DASH to clean_bullets
Fixes #4105
## What's the problem?
When using
[clean_bullets()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:36:0-48:31),
the EN DASH character (`–`, `\u2013`) isn't recognized as a bullet
point. This is a problem because some PDFs use EN DASH as bullet
markers.
Currently, users have to call
[clean_dashes()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:336:0-344:50)
as a workaround, but that removes *all* EN DASHes in the text—not just
the ones at the start of lines.
## The fix
Added `\u2013` (EN DASH) to the `UNICODE_BULLETS` list in
[patterns.py](cci:7://file:///root/74/silver/unstructured/unstructured/nlp/patterns.py:0:0-0:0).
Now
[clean_bullets()](cci:1://file:///root/74/silver/unstructured/unstructured/cleaners/core.py:36:0-48:31)
handles EN DASH bullets the same way it handles other bullet characters.
## Testing
Added test cases to verify:
- EN DASH at the start of text is cleaned
- EN DASH in the middle of text is preserved
- Existing bullet types still work
All tests pass.
---
Contribution by Gittensor, see my contribution statistics at
https://gittensor.io/miners/details?githubId=94194147