DeepSpeed
Introduce Megatron-style parallel state management
#7726
Open

Introduce Megatron-style parallel state management #7726

eternalNight
sfc-gh-truwase
delock
eternalNight
eternalNight eternalNight force pushed from df74e043 to 383eeb18 40 days ago
eternalNight eternalNight force pushed from 383eeb18 to fa341163 40 days ago
eternalNight eternalNight marked this pull request as ready for review 40 days ago
eternalNight eternalNight requested a review from tjruwase tjruwase 40 days ago
eternalNight eternalNight requested a review from tohtana tohtana 40 days ago
eternalNight eternalNight assigned eternalNight eternalNight 40 days ago
tohtana
eternalNight
Daydreamer-Li Daydreamer-Li requested a review from loadams loadams 25 days ago
Daydreamer-Li Daydreamer-Li force pushed from 45bfe993 to 39cd316d 25 days ago
sfc-gh-truwase
eternalNight
delock
eternalNight
Introduce Megatron-style parallel state management
efee3ef7
eternalNight parallel_state: Cleanup dependency on ProcessGroupNCCL.Options
e319923f
feat: add config-based parallel state initialization with validation
684f0964
Add sequence parallel support to refactored parallel state
4b362e48
fix: remove Chinese comment from config example
703c1fe8
fix: use torch.distributed.new_group directly in _create_group
04f56a9a
fix: correct SP parallel group creation logic in parallel_state
845bd817
refactor: simplify _create_group to use deepspeed.comm interface
2ec5c69c
feat: migrate All-to-All groups to parallel_state architecture
33a3ca8f
fix: disable gloo process groups by default
138fbebd
refactor: simplify SP group creation using RankGenerator
2aab5b85
docs: fix config example and SP usage notes
a0975c7e
refactor: remove unused is_torch_min_version function
58f7f51a
refactor: simplify config-based initialization to use top-level fields
ff67ec59
eternalNight tests: Drop test_mpu.py from the PR
11e5a2c1
fix: filter unsupported params in initialize_parallel_state_from_conf…
f7bd2dcd
fix: improve initialize_parallel_state_from_config for nested config …
537a8993
docs: add parallel state management documentation
856ed67c
Fix ParallelState mpu compatibility and refactor unit tests
10847574
refactor: unify sequence-data parallel API naming
e74fe736
Rename parallel_state_deepspeed to parallel_state_wrappers
445b71e7
parallel_state: Take parallelism sizes from existing parameters only
da15a8a0
tests: Test ZeRO + Ulysses SP training using ParallelState
b1e16c43
eternalNight eternalNight force pushed from 9e9416ad to b1e16c43 3 days ago
eternalNight Merge branch 'master' into eternalNight/unify_process_group_management
5fbbb014
eternalNight

Login to write a write a comment.

Login via GitHub

Assignees
Labels
Milestone