unstructured
2b44f7df - Feat: patch pdfminer and use rendermode to detect invisible text (#4158)

Commit
34 days ago
Feat: patch pdfminer and use rendermode to detect invisible text (#4158) This PR updates the logic to detect invisible text: - recent bump for `pdfminer` (to fix CVE) disabled the route to use color data to determine if a piece of text is invisible or not - this PR uses a custom pdf interpreter that exposes render mode information for an `LTChar` object then use that to determine of a piece of text is invisible Note on ingest test update: The file `Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf` contains invisible white space and line breaks in text. Those are cleaned up by post processing but they do mean that the text we got are not 100% unchanged embedded text in the pdf data itself. Moreover in some other files the post processing many not be able to remove some of the extra invisible white space. Both points justifies the change to the flag of `is_extracted` from `True` to `partial` for some of the elements (that post processing removed the invisible white space) To check the invisible text in that fine run ```python from unstructured.partition.pdf_image.pdfminer_processing import process_file_with_pdfminer layout, _ = process_file_with_pdfminer("Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf") layout[0].texts ``` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
Author
Parents
Loading