Feat: patch pdfminer and use rendermode to detect invisible text (#4158)
This PR updates the logic to detect invisible text:
- recent bump for `pdfminer` (to fix CVE) disabled the route to use
color data to determine if a piece of text is invisible or not
- this PR uses a custom pdf interpreter that exposes render mode
information for an `LTChar` object then use that to determine of a piece
of text is invisible
Note on ingest test update:
The file `Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf`
contains invisible white space and line breaks in text. Those are
cleaned up by post processing but they do mean that the text we got are
not 100% unchanged embedded text in the pdf data itself. Moreover in
some other files the post processing many not be able to remove some of
the extra invisible white space. Both points justifies the change to the
flag of `is_extracted` from `True` to `partial` for some of the elements
(that post processing removed the invisible white space)
To check the invisible text in that fine run
```python
from unstructured.partition.pdf_image.pdfminer_processing import process_file_with_pdfminer
layout, _ = process_file_with_pdfminer("Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf")
layout[0].texts
```
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>