feat(trainer): Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals (#41723)
* Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals and cuda streams.
* Fix failing ci tests
* Update JIT checkpoint code to remove CUDA streams and async checkpointing. Introduce sentinal file to identify incomplete checkpoints. Update trainer arg doc and tests.
* Fix sentinel file save path to checkpoint folder, update checkpoint related envs with HF_ prefix.
* Refactor JIT checkpoint logic: rename methods and variables for clarity, improve SIGTERM handling, and update related tests.
* Remove support for environment variable overrides in `TrainingArguments` and update related documentation.
* Apply style fixes
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>