Feat: is extracted only true when text has trivial amount of invisible text (#4128)
## summary
This PR modifies the logic that detects if a text is `IsExtracted.TRUE`.
- a text is `IsExtracted.TRUE` when it is embedded text rendered on the
page -> this is useful downstream to understand quality of the text vs.
the pdf render
- current logic labels any text found by `pdfminer` as
`IsExtracted.TRUE`
- however, sometimes a pdf file may be a scanned page and it also can
contain text in its metadata that are OCR sourced text. Those text often
has poor quality and are always invisible so they are not rendered on
the page
- the modified logic detects if a piece of text found by `pdfminer` has
non-trivial amount of invisible text; if no then it is labeled
`IsExtracted.TRUE`
For example the newly added test file `pdf-with-ocr-text.pdf` is a
scanned page with one line at the bottom that is truly embedded text
(text rendered on page). But `pdfminer` finds more text that just that
line and those other text only roughly matches the content on the
scanned page. Those are not embedded text as they are not rendered on
page and are hidden invisible. The updated logic correctly identifies
those text as not `IsExtracted.TRUE`.
## pitfall of the current solution
This solution does not consider cases where the text color matches the
background color therefore becomes invisible.