unstructured
b1b8eae3 - fix(doc): fix disk-space leak (#3019)

Commit
1 year ago
fix(doc): fix disk-space leak (#3019) **Summary** Remedy disk-space leak where `partition_doc()` would leave a copy of each `.doc` file passed as a file-like object on disk. **Additional Context** `partition_doc()` creates a temporary file in which it writes each source-document provided as a file-like object. This file is not deleted and disk consumption grows without bound. The `convert_office_doc()` function used to convert DOC->DOCX uses a command-line program provided with LibreOffice to convert do the conversion. Because this command-line program operates in a different memory space, the source file cannot be passed as an in-memory object and needs to be on the filesystem. When the DOC file is passed as a file-like object, it is written to disk so the conversion program has access to it. It is not deleted afterward. Fix this by writing the temporary source DOC file in the TemporaryDirectory already being used to write the conversion-target DOCX file. That directory is automatically removed when `partition_doc()` completes.
Author
Parents
Loading