unstructured
e5d08662 - enhancement: memory efficient xml partitioning (#1547)

Commit
2 years ago
enhancement: memory efficient xml partitioning (#1547) Closes #1236. Partitions XML documents iteratively in most cases*, never loading the entire tree into memory. This ends up being much faster. (* The exception is when the argument `xml_path` is passed to filter elements. I was not able to find a way in Python to compare XPaths while streaming the elements, aside from writing a custom XPath parser. So the shortest way forward was to bite the bullet and load the whole tree in memory when filtering by XPath.) Memory usage is about 20% of usage on `main` when processing a 470MB XML file. Time to process is 10s vs 900s. Output is slightly different, but appears to be an improvement, adding lines of text that are skipped in current partitioning. No text is lost.
Author
Parents
Loading