transformers
refactor + robusts tests for Tensor Parallel
#42809
Merged

refactor + robusts tests for Tensor Parallel #42809

3outeille merged 65 commits into main from v5-test_tensor_parallel_moe
3outeille
3outeille begin Moe test tensor parallel
40b3e2b6
3outeille create tiny moe model + fix test tensor parallel Moe
05172a98
3outeille create tiny moe model + fix test tensor parallel Moe
d75f4b86
3outeille Merge branch 'main' into v4.57.1-test_tensor_parallel
06635f77
3outeille Merge branch 'main' into v4.57.1-test_tensor_parallel
000c33fd
3outeille fix backward pass test in tensor parallel for Dense model (#42811)
5f548ed9
HuggingFaceDocBuilderDev
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
48c69f7f
3outeille use mixtral instead for testing
87fb140d
3outeille fix dtensor and tensor mismatch
95240730
3outeille linting
ba79de05
3outeille checkout test tensor parallel to be like main
3fed52d7
3outeille Merge branch 'main' into fix_dtensor_tensor_moe_mismatch
ad0f203b
3outeille Merge branch 'main' into fix_dtensor_tensor_moe_mismatch
d6da5af8
3outeille avoid hack and create class instead
12ff9a4b
3outeille fix loading ep
b337af76
3outeille add moe test
7f19dbfb
ArthurZucker
ArthurZucker commented on 2025-12-17
3outeille now EP inference works again but pass still fails
d677102d
3outeille 3outeille force pushed from 88989a6c to d677102d 50 days ago
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
dc86437c
3outeille linting
0155b0f0
3outeille now load from checkpoint. Creating a nn.Parameter for param_value wil…
531561dc
3outeille 3outeille changed the title add tensor parallel test for MoE Fix tensor parallel for MoE 48 days ago
3outeille forward now works (add LocalPackedColwise + dont use EP router)
19bfcef7
3outeille for now test in float32
f99a67e8
3outeille
3outeille commented on 2025-12-19
3outeille
3outeille commented on 2025-12-19
3outeille dont do all_reduce manually for GatherParellel. Convert to dtensor ap…
f88b4909
3outeille 3outeille changed the title Fix tensor parallel for MoE Fix distributed training for MoE 47 days ago
3outeille
ArthurZucker
ArthurZucker approved these changes on 2026-01-06
ArthurZucker
ArthurZucker commented on 2026-01-06
3outeille Remove dtensor dependency in Tensor Parallel (#43157)
b0c5a981
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
f78d420c
3outeille 3outeille changed the title Fix distributed training for MoE Tensor Parallel: API + robusts tests + distributed training CI 22 days ago
3outeille tp workf for dense and moe in float32 only
3b5d5469
3outeille Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
316c15ca
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
ad2c3d50
3outeille fix merge conflicts that broke TP
dc711cf2
3outeille Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
8f40cf8f
3outeille revert parsing for tp plan
04e1944a
3outeille all reduce after experts
9e4da160
3outeille compile compatible dist ops
3372a0e1
3outeille fix gate_up_proj gradient test by doing splitting thtat takes into ac…
598a65c0
3outeille fix moe backward fp32
6d7c93ff
3outeille remove functional.Linear to use nn.Linear in experts (this way we att…
372aab9c
github-actions
3outeille moe work with tied embedding as well
0b495cbb
ArthurZucker Merge branch 'main' into v5-test_tensor_parallel_moe
7d126d10
ArthurZucker style
3403db89
3outeille 3outeille changed the title Tensor Parallel: API + robusts tests + distributed training CI Tensor Parallel: API cleaning + robusts tests 10 days ago
3outeille 3outeille changed the title Tensor Parallel: API cleaning + robusts tests refactor + robusts tests for Tensor Parallel 10 days ago
3outeille all tests pass
b670eed4
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
0f8f2e1f
3outeille Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
c667ebec
3outeille make fix-up
59f4cc73
3outeille 3outeille requested a review from ArthurZucker ArthurZucker 10 days ago
3outeille
3outeille commented on 2026-01-26
ArthurZucker
ArthurZucker commented on 2026-01-27
Cyrilvallez
Cyrilvallez commented on 2026-01-27
vasqu
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
5e117929
3outeille typo
0144ea2f
3outeille use transformer seed + pytest parametrized
ecd309f9
3outeille Moved weight and bias dim mapping to ParallelInterface
ffe76c9d
3outeille simplifed shard tensor signature
5203750f
3outeille sync shard_tensor logic with the one in origin/main
7907d502
3outeille add function check to avoid mismatch check during set_param_for_module
d6cb1454
3outeille
github-actions
3outeille remove disable. I was in an older torch version
33208b3e
3outeille Add pytest skip condition for tensor parallel tests requiring PyTorch…
845c269f
3outeille linting
4350cfc7
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
b0e2c596
3outeille linting
65066dc6
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
f48b2070
3outeille 3outeille requested a review from ArthurZucker ArthurZucker 7 days ago
3outeille fixing remaining modular
ab98ee5c
3outeille linting
598c9001
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
52336899
3outeille
github-actions
github-actions
ArthurZucker
ArthurZucker approved these changes on 2026-01-29
Cyrilvallez
Cyrilvallez approved these changes on 2026-01-30
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
43dc42a4
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
7a831798
3outeille
3outeille Refactor get_expected_sharded_shape to be only one call
6f08529e
3outeille Merge branch 'v5-test_tensor_parallel_moe' of https://github.com/hugg…
a14c8518
3outeille Remove redundant prepare_module_tp method from TensorParallelLayer su…
6845edc2
github-actions
3outeille Merge branch 'main' into v5-test_tensor_parallel_moe
91a56dbd
3outeille 3outeille enabled auto-merge (squash) 6 days ago
3outeille 3outeille merged eaab9f2f into main 6 days ago
3outeille 3outeille deleted the v5-test_tensor_parallel_moe branch 6 days ago

Login to write a write a comment.

Login via GitHub

Assignees
No one assigned
Labels
Milestone