unstructured
dce53453 - Feat: is extracted only true when text has trivial amount of invisible text (#4128)

Commit
9 days ago
Feat: is extracted only true when text has trivial amount of invisible text (#4128) ## summary This PR modifies the logic that detects if a text is `IsExtracted.TRUE`. - a text is `IsExtracted.TRUE` when it is embedded text rendered on the page -> this is useful downstream to understand quality of the text vs. the pdf render - current logic labels any text found by `pdfminer` as `IsExtracted.TRUE` - however, sometimes a pdf file may be a scanned page and it also can contain text in its metadata that are OCR sourced text. Those text often has poor quality and are always invisible so they are not rendered on the page - the modified logic detects if a piece of text found by `pdfminer` has non-trivial amount of invisible text; if no then it is labeled `IsExtracted.TRUE` For example the newly added test file `pdf-with-ocr-text.pdf` is a scanned page with one line at the bottom that is truly embedded text (text rendered on page). But `pdfminer` finds more text that just that line and those other text only roughly matches the content on the scanned page. Those are not embedded text as they are not rendered on page and are hidden invisible. The updated logic correctly identifies those text as not `IsExtracted.TRUE`. ## pitfall of the current solution This solution does not consider cases where the text color matches the background color therefore becomes invisible.
Author
Parents
Loading