feat(trainer): Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals #41723
efazal
force pushed
from
a457b744
to
05411f37
69 days ago
stas00
commented
on 2025-10-22
efazal
force pushed
from
e093fcd7
to
5b21304d
63 days ago
efazal
force pushed
from
5b21304d
to
16b0a402
43 days ago
efazal
force pushed
from
16b0a402
to
00a9c2b6
30 days ago
Just-in-time (JIT) asynchronous checkpointing using SIGTERM signals a…
b95a867b
Fix failing ci tests
a1389622
Update JIT checkpoint code to remove CUDA streams and async checkpoin…
44433fb0
Fix sentinel file save path to checkpoint folder, update checkpoint r…
4ab2427d
Refactor JIT checkpoint logic: rename methods and variables for clari…
929d2dcc
Remove support for environment variable overrides in `TrainingArgumen…
1eb7f1f2
efazal
force pushed
from
434c4b58
to
1eb7f1f2
29 days ago
Merge branch 'main' into feat-jit-checkpointing
6e6f4c26
Apply style fixes
57a9df16
SunMarc
approved these changes
on 2025-12-03
SunMarc
merged
fda2d735
into main 26 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub