DeepSpeed
Add workflow to run full tests
#7783
Closed

Add workflow to run full tests #7783

tohtana wants to merge 47 commits into master from tohtana/add_full_test_workflow
tohtana
tohtana add workflow to run full tests
7ff51dca
tohtana Test with -n 1 to debug parallel execution issues
9b556a5c
tohtana tohtana force pushed from f909e3e4 to 9b556a5c 64 days ago
tohtana Merge: set TORCH_CUDA_ARCH_LIST=8.9 and use -n 1 for debugging
96d36748
tohtana Fix bf16 checkpoint optimizer state and muon test
e8275323
tohtana Skip aio tests that hang in CI environment
fa213d08
tohtana Skip more hanging ops tests in CI
9195771f
tohtana Fix ulysses PEFT test to use mpu object instead of global groups
0e52d3a8
tohtana Skip pipeline parallelism tests that timeout in CI
b77f9c80
tohtana Skip CPU adam tests that timeout in CI
29eff21a
tohtana Skip zenflow tests that timeout in CI
e0e1cabb
tohtana Skip pipeline checkpoint tests that timeout in CI
70d90c50
tohtana Skip test_multiple_models.py that timeouts
5af1f37e
tohtana Run tests in parallel with -n 4 instead of sequential
a3a9f8e9
tohtana Skip onebit tests that timeout with pipeline config
02d5ed73
tohtana Skip test_ds_initialize.py tests that timeout
0fd0a10d
tohtana Skip test_zero_leaf_module.py tests that timeout
a6fb0cb1
tohtana Skip test_zero_tensor_fragment.py tests that timeout
35ec1e2f
tohtana Skip test_mup_optimizers.py tests that timeout
78d05807
tohtana Skip test_user_args.py shell quoting edge cases
6077494e
tohtana Skip nvme checkpointing tests (no nvme device in CI)
ad2a74df
tohtana Enable async I/O tests with DS_DISABLE_REUSE_DIST_ENV
bcdfe3d0
tohtana Remove test ignores to validate DS_DISABLE_REUSE_DIST_ENV fix
ee61faab
tohtana Fix: Use /mnt/aio/pytest subdirectory for basetemp
bfa48315
tohtana fix(pipeline): set _running_engine_backward for non-last stage backward
6b8290a4
tohtana Skip GDS tests in CI (no GPUDirect Storage hardware)
3d137c6f
tohtana Install pdsh for launcher tests
30a80d06
tohtana Add pdsh, skip zenflow tests (timeout)
c6e60089
tohtana fix: BF16_Optimizer selection and compatibility issues
dfc78347
tohtana fix: skip empty parameters in gradient reduction
505ffa67
tohtana fix(test): add bf16 model with fp32 grad_accum to supported configs
a02bc6e8
tohtana ci: increase parallel test workers to 8
12a6e95d
tohtana ci: enable zenflow tests
121c7e0a
tohtana ci: skip launcher tests requiring SSH
5daced1c
tohtana Skip zenflow tests due to pre-existing Stage 3 bugs
274d3617
tohtana Skip ZenFlow torch adam test (CUDA/fork incompatibility)
ba296ebb
tohtana Mark manual dist init tests as sequential to avoid port conflicts
c993e842
tohtana Add debug test for RowParallel numerical differences
ded0436d
tohtana Update debug workflow to run testRowParallel with multiple seeds and …
ee5e166a
tohtana Debug: Run testRowParallel and sequential tests with multiple seeds
248cfd16
tohtana Debug: Fix testRowParallel selection and use assert_close for diagnos…
667157fb
tohtana Debug: Also update testColumnParallel to use assert_close
7a5eebd9
tohtana Fix autoTP test numerical tolerance with assert_close
b9a5e995
sdvillal Fix Evoformer compilation (#7760)
06755045
tohtana fix format
a4500e67
tohtana Temp: Add CUTLASS and run only Evoformer tests
e668921d
tohtana Fix: Remove --forked from Evoformer test to avoid CUDA fork issue
f61ed519
tohtana Add CUTLASS support and mark Evoformer test as sequential
dca165bf
tohtana
tohtana tohtana closed this 62 days ago

Login to write a write a comment.

Login via GitHub

Reviewers
No reviews
Assignees
No one assigned
Labels
Milestone