turbo-persistence: stop background persisting after unrecoverable failure (#92106)
### What?
When a persist or compaction operation fails in `turbo-persistence`, the
database now:
- Rolls back cleanly (deletes orphan files, restores CURRENT)
- Stops the background persisting process for the session
- Keeps in-memory state consistent with on-disk state at all times
- Deletes superseded files safely (with Windows fallback for open memory
maps)
### Why?
Previously, a failed write operation (e.g. disk full, I/O error) would
leave the database in a broken state:
1. **Misleading error loop** — The `active_write_operation` `AtomicBool`
was left set to `true` after failure, so every subsequent snapshot cycle
printed _"another write operation is already in progress"_ forever,
hiding the real error.
2. **In-memory corruption** — `commit()` mutated `inner.meta_files` and
`inner.current_sequence_number` *before* writing the CURRENT file to
disk. If a disk error occurred between those two steps, the in-memory
state was inconsistent with disk and the rollback had no way to fix it.
3. **Rollback could corrupt committed data** — If `commit()` failed
*after* writing CURRENT (e.g. during old-file deletion or LOG writing),
the rollback would delete the *newly committed* files, corrupting the
database.
4. **Task graph corruption** — `save_snapshot` consumes task cache log
entries. If it failed, those entries were lost, but the background loop
would continue trying to persist — silently skipping those tasks and
corrupting the task graph in storage.
5. **Partially written CURRENT** — If the failure happened mid-write to
the CURRENT file, it could be left with partial/corrupt content, but
nothing restored it.
### How?
**`WriteOperationGuard` RAII (db.rs)**
A new `WriteOperationGuard<'a>` replaces the `AtomicBool` + manual
`try_recover_after_failed_write()` pattern. The guard holds:
- `&'a Mutex<Option<ActiveWriteState>>` — the write slot (`None` = idle,
`Some(Active("write batch"))` = in progress, `Some(Error)` = permanently
disabled)
- `path: &'a Path` — database directory for rollback
- `seq_before: u32` — sequence number at operation start
- `succeeded: bool` — set by `guard.success()`
On `drop`, if not succeeded:
1. Writes `seq_before` back to CURRENT (repairs a partially-written
CURRENT)
2. Deletes all files with `seq > seq_before` (orphans from the failed
operation)
3. Sets the slot to `None` (success) or `Some(Error)` (if cleanup itself
failed)
The `Active` variant carries a `&'static str` name (e.g. `"write
batch"`, `"compaction"`) used in error messages.
**Three-phase `commit()` (db.rs)**
`commit()` is restructured so `inner` is completely unmodified before
the point of no return:
| Phase | What happens | `inner` state | On failure |
|-------|-------------|---------------|------------|
| **A** | Compute `meta_seq_numbers_to_delete` via `sst_filter`. Uses
`apply_filter_collect` (read-only) to update filter state and collect
per-meta-file removal sets without modifying any MetaFile. Only a read
lock on `inner` is needed. | Unchanged | Guard deletes orphan files +
restores CURRENT; `inner` is intact |
| **B** | Write `.del` file and CURRENT to disk. | Unchanged | Same as
above |
| **C** | Apply deferred `retain_entries` (from A's removal sets),
append new metas, remove obsolete metas, bump `current_sequence_number`.
Try to delete superseded files; defer failures. | Updated | CURRENT is
already durable; commit is irreversible |
After CURRENT is written (point of no return), LOG writing errors are
caught and reported via `eprintln!` — they must not propagate because
the `WriteOperationGuard` would then run its rollback and delete the
*newly committed* files.
**`SstFilter::apply_filter_collect` (sst_filter.rs)**
A new read-only variant of `apply_filter` that updates the filter state
and returns a `FxHashSet<u32>` of SST entry sequence numbers to remove
from each meta file, without calling `retain_entries`. The original
`apply_filter` (which mutates the MetaFile) is still used by
`load_directory` and during new-meta-file construction where immediate
mutation is appropriate.
**Deferred file deletion (db.rs)**
Superseded `.sst`/`.meta`/`.blob` files are deleted immediately after
Phase C (once `inner` is updated). On Linux/macOS this always succeeds,
even if concurrent readers have the files memory-mapped. On Windows,
open memory maps prevent deletion — any file that fails is stored as a
`DeferredDeletion` enum (`Sst(u32)` / `Meta(u32)` / `Blob(u32)`) and
retried on the next commit or at shutdown. The `.del` file written
during Phase B ensures crash recovery via `load_directory` regardless.
**Background loop error handling (backend/mod.rs)**
- `snapshot_and_persist()` returns `Result<(Instant, bool),
anyhow::Error>` instead of `Option`. When `save_snapshot` fails, the
error propagates with `?`.
- The background loop matches on the `Result`: on `Err`, it logs the
error and a message that persisting is disabled for this session, then
returns (permanently stopping the background job).
- `has_unrecoverable_write_error()` checks the `ActiveWriteState::Error`
variant to detect permanent failure after compaction errors.
<!-- NEXT_JS_LLM_PR -->
---------
Co-authored-by: Tobias Koppers <sokra@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>