unstructured
c6c74628 - fix: set max decompressed size for elements JSON (#4244)

Commit
24 days ago
fix: set max decompressed size for elements JSON (#4244) Sets a max size on the decompressed version of an elements JSON. A quite large JSON from a 1225 page document is 5MB, for reference. One place we still might run into headroom issues is if a JSON from a quite large document included embedded digital images. The result of a JSON being too large, is that the decompressed version will not parse, as the tail will be left off. Part of the review should be to determine whether this is an acceptable failure mode. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches deserialization of compressed element payloads, which can affect ingestion/round-tripping for large documents and changes the failure mode to explicit exceptions when limits are hit. > > **Overview** > Adds a hard cap (`MAX_DECOMPRESSED_SIZE`, default 200MB) when inflating base64+gzipped elements JSON in `elements_from_base64_gzipped_json`, preventing unbounded memory/disk blowups; decompression now explicitly fails with `DecompressedSizeExceededError` (new) when the limit is hit, or `zlib.error` when the payload is incomplete/corrupt. > > Bumps version to `0.20.7`, updates the changelog, and adds targeted tests covering normal round-trip, incomplete streams, and size-limit exceedance (via patching the max size). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a5e52565cded2d2734d98a2f70eb29f82b90f91d. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Author
Parents
Loading