Fix: AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation (#21079)
* Fix: AsyncCheckpointIO snapshots tensors to avoid race with parameter mutation
Summary
- Root cause: Background thread serialized live tensor references; the training
thread mutated tensors after scheduling the async save, leading to mixed-step
checkpoints.
- Fix: Snapshot all tensors on the main thread before submitting the async save
using `apply_to_collection(..., torch.Tensor, lambda t: t.detach().clone())`.
Implementation
- Reproduce the issue in unit test
- Clone all tensors in the checkpoint payload on the caller thread to take a
point-in-time snapshot.
- Supports both positional and keyword `checkpoint` parameters.
- Preserves non-tensor values; handles nested containers.
- Continues to surface background exceptions on teardown.
* chlog
---------
Co-authored-by: Jirka B <j.borovec+github@gmail.com>