Go
Home
Pricing
FAQ
Install
Home
Pricing
FAQ
Install
Login
via GitHub
deepspeedai/DeepSpeed
Pull Requests
Commits
Open
Closed
Verify fix of CI hang (#7800)
#7851 opened 2026-02-13 08:13 by
tohtana
Fix count_used_parameters_in_backward crash on PyTorch < 2.3 (#7756)
#7849 opened 2026-02-12 20:06 by
harshang03
[BUG] Fix: Fix gradient norm calculation and dynamic shape blocking in PP+ZeRO1 collective communication
#7847 opened 2026-02-12 06:54 by
Thinksky5124
Fix ROCm BF16 conversion intrinsics in inference v2 (#7843)
#7846 opened 2026-02-12 02:49 by
tohtana
Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830)
#7841 opened 2026-02-10 03:38 by
tohtana
Throw error when parameter is modified in GatheredParameters
#7832 opened 2026-02-05 17:29 by
tohtana
Fix subgroup optimizer metadata inconsistency
#7820 opened 2026-01-27 11:19 by
st-bang97
fix: Ensure full gradient reduction for Muon with reduce_scatter
#7808 opened 2026-01-23 07:59 by
nathon-lee
Enable shm_comm support for arm
#7800 opened 2026-01-20 17:18 by
phalani-paladugu
[Draft] Muon Optimizer Support for ZeRO3
#7798 opened 2026-01-20 03:49 by
PKUWZP
Fix bf16 dtype mismatch in ZeRO-3 with zero_quantized_weights
#7792 opened 2026-01-18 05:04 by
juyterman1000
Fix Muon optimizer conflict with gradient clipping in ZeRO 1/2
#7776 opened 2026-01-12 11:44 by
fy817
Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759)
#7771 opened 2026-01-11 06:48 by
Antlera
Introduce all_reduce_hook to support gradient aggregation across replica groups.
#7764 opened 2026-01-07 03:07 by
zhengchenyu
feat: add parameter-level precision control for BF16 training
#7750 opened 2025-12-30 06:40 by
nathon-lee
Fix Muon optimizer checkpoint resume with bf16 mode
#7748 opened 2025-12-28 22:21 by
yurekami
Introduce Megatron-style parallel state management
#7726 opened 2025-12-15 12:40 by
eternalNight
let allgather and alltoall execute in parallel when both attention and MOE used TP
#7723 opened 2025-12-11 07:51 by
taozhiwei
fix: When there are tensors registered with register buffer in the weight file, the weights are only loaded on device 0 when loading weights across multiple devices.
#7717 opened 2025-12-08 03:57 by
KeeProMise
if no expert found in parameter that have expert in name the loop should continue
#7685 opened 2025-11-11 19:28 by
LckyLke
Configures workflow for offline unit tests
#7512 opened 2025-08-24 16:22 by
porfanid
Add world-size getter in Engine
#7479 opened 2025-08-09 09:01 by
WoosungMyung
Add EXAONE 4.0 model support for DeepSpeed inference v2 @
#7456 opened 2025-07-29 01:48 by
notkisk
Create COMMITTERS_RESPONSIBILITY.md
#7300 opened 2025-05-21 14:25 by
PKUWZP
HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP
#7212 opened 2025-04-10 10:13 by
Schwidola0607
gather output layout support for column parallel
#7181 opened 2025-03-28 03:18 by
inkcherry
[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)
#7130 opened 2025-03-11 08:54 by
cyr0930
[Draft] Add support for seq split in Domino
#7111 opened 2025-03-04 21:19 by
duanhx1037
Update Domino for Llama3
#7084 opened 2025-02-26 20:08 by
shenzheyu
Fix, pipeline model with moe cause error when send grad
#7055 opened 2025-02-19 11:53 by
wukong1992
Older