pytorch-lightning
2c74bee0 - Fix: AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation (#21079)

Commit
250 days ago
Fix: AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation (#21079) * Fix: AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation Summary - Root cause: Background thread serialized live tensor references; the training thread mutated tensors after scheduling the async save, leading to mixed-step checkpoints. - Fix: Snapshot all tensors on the main thread before submitting the async save using `apply_to_collection(..., torch.Tensor, lambda t: t.detach().clone())`. Implementation - Reproduce the issue in unit test - Clone all tensors in the checkpoint payload on the caller thread to take a point-in-time snapshot. - Supports both positional and keyword `checkpoint` parameters. - Preserves non-tensor values; handles nested containers. - Continues to surface background exceptions on teardown. * chlog --------- Co-authored-by: Jirka B <j.borovec+github@gmail.com>
Author
Parents
Loading