transformers
fda2d735 - feat(trainer): Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals (#41723)

Commit
26 days ago
feat(trainer): Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals (#41723) * Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals and cuda streams. * Fix failing ci tests * Update JIT checkpoint code to remove CUDA streams and async checkpointing. Introduce sentinal file to identify incomplete checkpoints. Update trainer arg doc and tests. * Fix sentinel file save path to checkpoint folder, update checkpoint related envs with HF_ prefix. * Refactor JIT checkpoint logic: rename methods and variables for clarity, improve SIGTERM handling, and update related tests. * Remove support for environment variable overrides in `TrainingArguments` and update related documentation. * Apply style fixes --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Author
Parents
Loading