Pull Requests microsoft/DeepSpeed

Fix minor comment/docstring typos in runtime and inference modules

#8046 opened 2026-06-03 08:21 by nathon-lee

zero3: defer param release during retain_graph backward #7352

#8045 opened 2026-06-03 06:55 by nathon-lee

Remove AutoSP assertion against Transformers version

#8044 opened 2026-06-02 19:04 by tohtana

zero3: invalidate coordinator trace on hook re-registration

#8043 opened 2026-06-02 13:46 by roycho96

Fix ZenFlow ZeRO-3 selective optimizer crash with parameter offload on nvme

#8042 opened 2026-06-02 01:30 by Antlera

activation_checkpointing: default num_layers to None so configure() assert fires

#8041 opened 2026-06-01 23:18 by Kymi808

Normalize ZeRO-3 DeepCompile grad dtype before reduction

#8038 opened 2026-05-30 06:13 by tohtana

Fix DeepCompile ZeRO-1 grad target lifetime

#8036 opened 2026-05-29 21:16 by tohtana

Enable bf16 check_grad_overflow by default (matching fp16)

#8035 opened 2026-05-29 03:38 by yongzhe-wang

Stop obsolete CI jobs on workflow cancellation

#8034 opened 2026-05-28 21:10 by tohtana

[Draft] Add ZeRO-3 elastic checkpoint save/load support

#8031 opened 2026-05-28 08:02 by nathon-lee

[Draft] Add On-Policy Distillation (OPSD) Trainer in DeepSpeed

#8027 opened 2026-05-26 07:31 by PKUWZP

Add Qwen 3.5 preset to AutoTP

#7978 opened 2026-04-16 12:51 by tohtana

Fix/warnings stacklevel mvapich runner

#7949 opened 2026-04-02 14:00 by nathon-lee

Refactor/torch autocast encapsulate global state

#7946 opened 2026-04-02 06:06 by nathon-lee

Add AutoEP

#7938 opened 2026-03-31 00:11 by tohtana

Fix ZeRO-3 optimizer initialization validation (#7844)

#7929 opened 2026-03-28 16:20 by amadhan882

Add torch_xla TPU support for ZeRO-1/2

#7917 opened 2026-03-21 18:43 by PKUWZP

doc: Remove suggestion to build extensions in parallel

#7899 opened 2026-03-12 15:58 by Flamefire

Fix Stage 0 + Ulysses crash: make bwc_tensor_model_parallel_rank() resilient to MP API absence

#7888 opened 2026-03-06 06:59 by nathon-lee

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter

#7878 opened 2026-02-27 06:46 by nathon-lee

fix: correct DistributedAttention output shape and pad uneven sequence lengths (#7842)

#7868 opened 2026-02-22 11:00 by harshang03

fix: keep fp32-pinned parameters out of the bf16 cast path in ZeRO-3 (#7747)

#7867 opened 2026-02-22 10:52 by harshang03

Revert "fix: remove premature MPI environment variable check in OpenMPIRunner"

#7864 opened 2026-02-21 01:39 by mikloorbi-sys

Fix global .cuh ignore and enforce tracked CUDA headers

#7858 opened 2026-02-18 04:38 by harshang03

Fix ZeRO legacy grad-hook crash when next_functions is missing

#7857 opened 2026-02-17 22:07 by harshang03

Reject non-finite fp16 loss_scale across config and ZeRO paths

#7856 opened 2026-02-17 18:13 by harshang03

Fix zero/division safety gaps in utility and inference paths

#7855 opened 2026-02-17 18:05 by harshang03

Fix count_used_parameters_in_backward crash on PyTorch < 2.3 (#7756)

#7849 opened 2026-02-12 20:06 by harshang03

[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication

#7847 opened 2026-02-12 06:54 by Thinksky5124