Add workflow to run full tests #7783
add workflow to run full tests
7ff51dca
Test with -n 1 to debug parallel execution issues
9b556a5c
tohtana
force pushed
from
f909e3e4
to
9b556a5c
64 days ago
Merge: set TORCH_CUDA_ARCH_LIST=8.9 and use -n 1 for debugging
96d36748
Fix bf16 checkpoint optimizer state and muon test
e8275323
Skip aio tests that hang in CI environment
fa213d08
Skip more hanging ops tests in CI
9195771f
Fix ulysses PEFT test to use mpu object instead of global groups
0e52d3a8
Skip pipeline parallelism tests that timeout in CI
b77f9c80
Skip CPU adam tests that timeout in CI
29eff21a
Skip zenflow tests that timeout in CI
e0e1cabb
Skip pipeline checkpoint tests that timeout in CI
70d90c50
Skip test_multiple_models.py that timeouts
5af1f37e
Run tests in parallel with -n 4 instead of sequential
a3a9f8e9
Skip onebit tests that timeout with pipeline config
02d5ed73
Skip test_ds_initialize.py tests that timeout
0fd0a10d
Skip test_zero_leaf_module.py tests that timeout
a6fb0cb1
Skip test_zero_tensor_fragment.py tests that timeout
35ec1e2f
Skip test_mup_optimizers.py tests that timeout
78d05807
Skip test_user_args.py shell quoting edge cases
6077494e
Skip nvme checkpointing tests (no nvme device in CI)
ad2a74df
Enable async I/O tests with DS_DISABLE_REUSE_DIST_ENV
bcdfe3d0
Remove test ignores to validate DS_DISABLE_REUSE_DIST_ENV fix
ee61faab
Fix: Use /mnt/aio/pytest subdirectory for basetemp
bfa48315
fix(pipeline): set _running_engine_backward for non-last stage backward
6b8290a4
Skip GDS tests in CI (no GPUDirect Storage hardware)
3d137c6f
Install pdsh for launcher tests
30a80d06
Add pdsh, skip zenflow tests (timeout)
c6e60089
fix: BF16_Optimizer selection and compatibility issues
dfc78347
fix: skip empty parameters in gradient reduction
505ffa67
fix(test): add bf16 model with fp32 grad_accum to supported configs
a02bc6e8
ci: increase parallel test workers to 8
12a6e95d
ci: enable zenflow tests
121c7e0a
ci: skip launcher tests requiring SSH
5daced1c
Skip zenflow tests due to pre-existing Stage 3 bugs
274d3617
Skip ZenFlow torch adam test (CUDA/fork incompatibility)
ba296ebb
Mark manual dist init tests as sequential to avoid port conflicts
c993e842
Add debug test for RowParallel numerical differences
ded0436d
Update debug workflow to run testRowParallel with multiple seeds and …
ee5e166a
Debug: Run testRowParallel and sequential tests with multiple seeds
248cfd16
Debug: Fix testRowParallel selection and use assert_close for diagnos…
667157fb
Debug: Also update testColumnParallel to use assert_close
7a5eebd9
Fix autoTP test numerical tolerance with assert_close
b9a5e995
Fix Evoformer compilation (#7760)
06755045
fix format
a4500e67
Temp: Add CUTLASS and run only Evoformer tests
e668921d
Fix: Remove --forked from Evoformer test to avoid CUDA fork issue
f61ed519
Add CUTLASS support and mark Evoformer test as sequential
dca165bf
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub