unstructured
9d228c7e - feat: calculate metric for percent of text missing (#1701)

Commit

2 years ago

feat: calculate metric for percent of text missing (#1701) ### Summary Missing text is a particularly important metric of quality for the Unstructured library because information from the document is not being captured and therefore not usable by downstream applications. Add function to calculate the percent of text missing relative to the source transcription. Function takes 2 text strings (output and source) as input, and returns the percentage of text missing as a decimal. ### Technical Details - The 2 input strings are both assumed to already contain clean and concatenated text (CCT) - Implementation compares the bags of words (frequency counts for each word present in the text) of each input text - Duplicated/extra text is not penalized - Value is limited to the range [0, 1] ### Test - Several edge cases are covered in the test function (missing text, duplicated text, spaced out words, etc). - Can test other cases or text inputs by calling the function with 2 CCT strings as "output" and "source"

References

#1701 - feat: calculate metric for percent of text missing

Author

shreyanid

Parents

e597ec7a

unstructured 9d228c7e - feat: calculate metric for percent of text missing (#1701)

unstructured
9d228c7e - feat: calculate metric for percent of text missing (#1701)