refactor + robusts tests for Tensor Parallel #42809
begin Moe test tensor parallel
40b3e2b6
create tiny moe model + fix test tensor parallel Moe
05172a98
create tiny moe model + fix test tensor parallel Moe
d75f4b86
Merge branch 'main' into v4.57.1-test_tensor_parallel
06635f77
Merge branch 'main' into v4.57.1-test_tensor_parallel
000c33fd
fix backward pass test in tensor parallel for Dense model (#42811)
5f548ed9
Merge branch 'main' into v5-test_tensor_parallel_moe
48c69f7f
use mixtral instead for testing
87fb140d
fix dtensor and tensor mismatch
95240730
linting
ba79de05
checkout test tensor parallel to be like main
3fed52d7
Merge branch 'main' into fix_dtensor_tensor_moe_mismatch
ad0f203b
Merge branch 'main' into fix_dtensor_tensor_moe_mismatch
d6da5af8
avoid hack and create class instead
12ff9a4b
fix loading ep
b337af76
add moe test
7f19dbfb
now EP inference works again but pass still fails
d677102d
3outeille
force pushed
from
88989a6c
to
d677102d
50 days ago
Merge branch 'main' into v5-test_tensor_parallel_moe
dc86437c
linting
0155b0f0
now load from checkpoint. Creating a nn.Parameter for param_value wil…
531561dc
3outeille
changed the title add tensor parallel test for MoE Fix tensor parallel for MoE 48 days ago
forward now works (add LocalPackedColwise + dont use EP router)
19bfcef7
for now test in float32
f99a67e8
dont do all_reduce manually for GatherParellel. Convert to dtensor ap…
f88b4909
3outeille
changed the title Fix tensor parallel for MoE Fix distributed training for MoE 47 days ago
Remove dtensor dependency in Tensor Parallel (#43157)
b0c5a981
Merge branch 'main' into v5-test_tensor_parallel_moe
f78d420c
3outeille
changed the title Fix distributed training for MoE Tensor Parallel: API + robusts tests + distributed training CI 22 days ago
tp workf for dense and moe in float32 only
3b5d5469
Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
316c15ca
Merge branch 'main' into v5-test_tensor_parallel_moe
ad2c3d50
fix merge conflicts that broke TP
dc711cf2
Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
8f40cf8f
revert parsing for tp plan
04e1944a
all reduce after experts
9e4da160
compile compatible dist ops
3372a0e1
fix gate_up_proj gradient test by doing splitting thtat takes into ac…
598a65c0
fix moe backward fp32
6d7c93ff
remove functional.Linear to use nn.Linear in experts (this way we att…
372aab9c
moe work with tied embedding as well
0b495cbb
Merge branch 'main' into v5-test_tensor_parallel_moe
7d126d10
style
3403db89
3outeille
changed the title Tensor Parallel: API + robusts tests + distributed training CI Tensor Parallel: API cleaning + robusts tests 10 days ago
3outeille
changed the title Tensor Parallel: API cleaning + robusts tests refactor + robusts tests for Tensor Parallel 10 days ago
all tests pass
b670eed4
Merge branch 'main' into v5-test_tensor_parallel_moe
0f8f2e1f
Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
c667ebec
make fix-up
59f4cc73
Merge branch 'main' into v5-test_tensor_parallel_moe
5e117929
typo
0144ea2f
use transformer seed + pytest parametrized
ecd309f9
Moved weight and bias dim mapping to ParallelInterface
ffe76c9d
simplifed shard tensor signature
5203750f
sync shard_tensor logic with the one in origin/main
7907d502
add function check to avoid mismatch check during set_param_for_module
d6cb1454
remove disable. I was in an older torch version
33208b3e
Add pytest skip condition for tensor parallel tests requiring PyTorch…
845c269f
linting
4350cfc7
Merge branch 'main' into v5-test_tensor_parallel_moe
b0e2c596
linting
65066dc6
Merge branch 'main' into v5-test_tensor_parallel_moe
f48b2070
fixing remaining modular
ab98ee5c
linting
598c9001
Merge branch 'main' into v5-test_tensor_parallel_moe
52336899
Merge branch 'main' into v5-test_tensor_parallel_moe
43dc42a4
Merge branch 'main' into v5-test_tensor_parallel_moe
7a831798
Refactor get_expected_sharded_shape to be only one call
6f08529e
Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
a14c8518
Remove redundant prepare_module_tp method from TensorParallelLayer su…
6845edc2
Merge branch 'main' into v5-test_tensor_parallel_moe
91a56dbd
3outeille
enabled auto-merge (squash) 6 days ago
3outeille
merged
eaab9f2f
into main 6 days ago
3outeille
deleted the v5-test_tensor_parallel_moe branch 6 days ago
Assignees
No one assigned
Login to write a write a comment.
Login via GitHub