fix doctype parsing error (#3811)

Commit

1 year ago

fix doctype parsing error (#3811) - per [ticket](https://unstructured-ai.atlassian.net/browse/ML-551), there is a bug in the `unstructured` lib under metrics/evaluate.py that incorrectly retrieves the file extension before the conversion to cct file from paths like '*.pdf.txt' . (see below screenshot) - the current status is in the top example - we should have the correct version in the bottom example of the screenshot. ![image](https://github.com/user-attachments/assets/6d82de85-3b54-4e77-a637-28a27fcb279d) - in addition, i also observe the doctype returned are not aligned, some returning '.*' and some are returning without the dot. - therefore, i just aligned them to be output into the same version which is '.*".

References

#3811 - fix doctype parsing error

Author

tbs17

Parents

4140f625

unstructured 8c58bc57 - fix doctype parsing error (#3811)

unstructured
8c58bc57 - fix doctype parsing error (#3811)