🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722)
* introducing test tensor parallel mixing to catch TP related error
* Remove test file for tensor parallel functionality
* Refactor dense and MoE test scripts for parallel execution and improved GPU management
- Updated `run_dense_tests.sh` and `run_moe_tests.sh` to support parallel execution of tests using available GPU pairs.
- Changed variable names for clarity, replacing `NUM_GPUS` with `GPUS_PER_TEST`.
- Enhanced output messages to reflect the number of parallel test slots and GPU usage.
- Implemented logic to handle skipped tests and updated result reporting to include skipped counts.
- Removed `TensorParallelTesterMixin` from `CausalLMModelTest` and integrated it into `ModelTesterMixin` for better structure in test classes.
* restore
* add all reduce for ep
* fix init and bias sharding
* fix finalize weight init
* add full stacktracing
* fix
* add report to run tests
* okay big improvement here
* the only case shard index should be used is when we are acctually collecting for mergeModuleList
* more fixes
* fix EP forward gpt oss
* add test that trigger the weight converter or only dynamoc loading
* Update test scripts to use new tensor parallel test keyword
- Modified `run_dense_tests.sh` and `run_moe_tests.sh` to change the pytest keyword from "test_tensor_parallel" to "test_tp_" for improved test targeting.
- Cleaned up comments and removed unused code in `test_tensor_parallel_mixin.py` to streamline the testing process and enhance readability.
* cleaning + find_port + remove comments
* revert some shit
* when you are stupid sometimes you really need a brain :) :) :) :)
* fix TP
* Ok GPT oss is fixed now
* try to fix perms
* test only causal llm
* attempt to fix
* am I a doomer and AI is not that bad?
* fix
* it "passes" but the output is shit
* style my man
* outputs are gonna be giberish but at least the forward pass "works"
* dtyle
* fix mixtral
* okay shape fixes
* tensor idx is only for groupped gemm / EP
* fix gate_up shard
* fix :)
* revert some EP changes that are breaking other stuff
* style
* fix solar open tp
* trigger test on deepseek v3
* fix glm4_moe tp
* fix glm4 moe lite tensor parallel
* fix longcat and glm4_moe_lite by all reducing gradients of k_rot
* fix ernie4_5_moe
* fix qwen3 by all reduce grads of q_norm
* fix deepseek v3 tp (need a constant dropout other different RNG + all_reduce backward for K rotary)
* Rename ReplicatedInTP to ReplicatedWithGradAllReduce and update references in tensor_parallel.py
* fix minimax_m2
* fix deepseek v2 for TP
* fix minimax
* fix qwen3_next for TP
* fix dots1 tp
* fix flex_olmo TP
* fix qwen3 tp dense
* fix exaone4 tp
* fix gemma3 tp
* fix apterus TP
* fix seed_oss tp by setting 0 to dropout
* fix gemma3n for TP
* dropout set to 0 for test + gradient slicing depending on fused weights or not
* make fixup + glm4 important fix on tp plan to avoid assigning wrong TP plan
* linting
* remove shell scripts
* make test tensor parallel triggering the CI
* fix ci
* fix ci
* mark it as ep_plan
* add @require_torch_multi_accelerator
* fix CI
* undo pr merge tensor parallel
* revert core model loading file
* revert modeling_utils file
* small fix in modeling_utils
* Update tensor parallel test configurations to enable tests by default and standardize seed values for reproducibility.
* linting
* Reorganize imports in modeling_utils.py to maintain consistency
* fix qwen3_5_moe tp
* fix glm moe dsa tp
* fix qwen3_5 tp
* Add training_overfit_steps parameter to Gemma3nTextModelTest
* fix 16 bits alignment
* Add WeightConverter for gate_up_proj and down_proj with 16 bytes alignment in checkpoint mapping
* Add solar_open mapping with WeightConverter for gate_up_proj and down_proj, ensuring 16 bytes alignment
* Update hub metadata (#43892)
* update
* reorder
* Add MlaKvAProjParallel layer for MLA attention and update TP plans
- Introduced MlaKvAProjParallel class to handle kv_a_proj_with_mqa in tensor parallelism.
- Updated prepare_module_tp methods to accept model parameter for better integration.
- Adjusted base_model_tp_plan in various configurations to include mla_kv_a_proj.
- Removed redundant all_reduce_backward calls from DeepseekV2 and DeepseekV3 attention implementations.
* fix doc
* force 16 Bytes Alignment
* fix slice tensor
* more doc
* better abstraction for zero experts
* linting
* refactor
* redudancy in tests
* simplify
* revert
* fix gemma2
* fix
* make tests work only on CPU
* linting
* skip tests for run_slow
* cleaning
* cleaning
* enhance doc on dynamic weight loading
* add config instead of model for tp
* more doc to tensor parallel for MlaKvAProjParallel
* use -1 instead of self.num_heads, this way when TP is used, it can infer the local_num_heads size
* fix modular glm_moe_dsa
* collect all gradient failure tests before stopping at first one
* generate more max new tokens for tensor parallel test as models are smalls
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* compare generated tokens for tensor parallel tests
* use attr config as much as possible
* add TP + quantized tests
* raise error if attr does not exist to say add it to the auto mapping
* update doc
* install torchao for tp + quantization tests
* update doc
* update doc
* update doc
* update doc
* udapte doc
* update doc
* partially fix tp + quantization generation
* partially fix tp + quantize
* skipping some tp + quantized test for now
* guard torchao import for test_training_ci
* Update src/transformers/models/longcat_flash/modular_longcat_flash.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* move file
* fix linting
* fix linting
* fix port conflict in test
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>