next.js
28df39ba - turbo-persistence: streaming SST writer for reduced memory usage (#90617)

Commit
2 days ago
turbo-persistence: streaming SST writer for reduced memory usage (#90617) ### What? Replace the two-pass SST file write approach with a single-pass `StreamingSstWriter` that writes blocks incrementally as entries arrive. Also replace the compaction `Collector`'s double-buffered `last_entries` pattern with a streaming collector that wraps the new writer. ### Why? The previous SST write path materialized all entries in memory (keys + values), then wrote all value blocks, then all key blocks. During compaction, the `Collector` maintained two full entry vectors (`entries` + `last_entries`) for size balancing. For large SST files (256MB+), this required holding hundreds of MB of entry data in memory simultaneously. The streaming approach reduces peak memory per writer from hundreds of MB to ~200KB by writing blocks to disk as soon as possible rather than buffering everything. ### How? **Phase 1: `StreamingSstWriter` in `static_sorted_file_builder.rs`** The key insight is that the SST reader is **block-index-addressed** — it locates blocks by index via the offset table, not by file position. So interleaving value blocks and key blocks in any order is fully compatible with the reader. The writer processes entries one at a time: - **Medium values** are written to disk immediately as individual blocks - **Small values** accumulate in a buffer and flush when reaching 8KB - **Key blocks** flush incrementally as the resolved boundary advances A `VecDeque<PendingKeyEntry>` holds entries waiting for their value blocks to be written. Each pending entry contains a `ValueRef` that tracks where its value lives: - `Small { block_index, offset, size }` — already resolved - `PendingSmall { small_block_id, offset, size }` — waiting for small block flush - `Medium { block_index }` — already resolved (written immediately) - `Inline`, `Blob`, `Deleted` — no value block needed The `first_pending_small_index` partitions `pending_keys` into resolved entries (front) and unresolved entries (back). Key blocks are flushed **incrementally** via `try_flush_key_blocks()` at the two points where this boundary advances: - When a small value block is flushed — resolves all `PendingSmall` entries, then calls `try_flush_key_blocks()` to flush any complete key blocks from the resolved region - In `advance_boundary_to()` — processes resolved entries and flushes key blocks via `try_flush_key_block()` when they exceed size or entry count limits This makes adding N entries **O(N) total** instead of O(N²) from the earlier per-call scan approach. **Key block flushing**: The `try_flush_key_block()` method combines the check-and-flush logic — it tests whether adding the next entry would exceed `MAX_KEY_BLOCK_SIZE` (16KB) or `MAX_KEY_BLOCK_ENTRIES`, and if so, flushes the accumulated key block. Entries sharing the same key hash are never split across blocks. **AMQF filter**: Created with `max_entry_count` (an upper bound), then shrunk via `shrink_to_fit()` on `close()` to reclaim unused capacity while preserving false positive rates. **Index block writing**: `IndexBlockBuilder` is generic over `Write` and writes directly to the `BufWriter` during `close()`, avoiding an intermediate buffer allocation. Index blocks use `write_raw_block_to_file` since they are never compressed. **File size**: Computed from the last block offset plus the offset table size, avoiding a `stream_position()` call (which would trigger an unnecessary flush + seek syscall). `write_static_stored_file()` is reimplemented as a thin wrapper around `StreamingSstWriter`, so existing callers (`write_batch.rs`, benchmarks) need zero changes. **Phase 2: Streaming compaction in `db.rs`** The compaction `Collector` is simplified to wrap an `Option<(u32, StreamingSstWriter)>`. The old pattern of maintaining `entries` + `last_entries` vectors with swap/early-flush/split logic (~100 lines) is replaced by direct calls to `writer.add()`. The collector tracks total key/value sizes and a `ValueBlockCountTracker` to know when to finish the current writer and start a new one. **Memory savings:** | | Previous (compaction) | Streaming | |---|---|---| | Entry storage | `entries` + `last_entries` (up to 2× full entry vecs) | ~800 pending key entries (~80KB) | | Value data | All values held in memory until written | Written immediately (medium) or after 8KB (small) | | Peak for 256MB SST | Hundreds of MB | ~200KB per writer | **Performance** Both compaction and write benchmarks are within the noise. It makes sense that this might be slightly slower due to some additional per-entry work, but writing is dominated by other factors `write` syscalls, compression and memcopies into temporary buffers. **Files changed:** - `turbopack/crates/turbo-persistence/src/static_sorted_file_builder.rs` — `StreamingSstWriter` with incremental key block flushing via `try_flush_key_block()`, `PendingKeyEntry`, `ValueRef`; generic `IndexBlockBuilder<W: Write>`; `write_static_stored_file` as wrapper; AMQF `shrink_to_fit()` on close - `turbopack/crates/turbo-persistence/src/db.rs` — Replaced compaction `Collector` with streaming version - `turbopack/crates/turbo-persistence/src/lib.rs` — Export `StreamingSstWriter` - `turbopack/crates/turbo-persistence/src/value_block_count_tracker.rs` — Removed dead `is_half_full()`/`reset_to()` methods **Verification:** - `cargo clippy -p turbo-persistence` — zero warnings - `cargo test -p turbo-persistence` — all 46 tests pass - `cargo fmt -p turbo-persistence -- --check` — clean - Criterion benchmarks show no performance regressions (write + compaction within noise)
Author
Parents
Loading