unstructured
4e61acc1 - fix(file): fix OLE-based file-type auto-detection (#3437)

Commit
1 year ago
fix(file): fix OLE-based file-type auto-detection (#3437) **Summary** A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files). **Fix** DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. `libmagic` is unreliable at this or doesn't try, reporting the generic `"application/x-ole-storage"` which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in. Unconditionally use `filetype.guess_mime()` provided by the `filetype` package that is part of the base unstructured install. Unlike `libmagic`, this package reliably detects the distinguished MIME-type (e.g. `"application/msword"`) for OLE file subtypes. Fixes #3364
Author
Parents
Loading