refactor + robusts tests for Tensor Parallel (#42809)
* begin Moe test tensor parallel
* create tiny moe model + fix test tensor parallel Moe
eaeaae
* create tiny moe model + fix test tensor parallel Moe
eaeaae
fix tensor parallel MoE test
fix tensor parallel MoE test
* fix backward pass test in tensor parallel for Dense model (#42811)
* fix
* linting
* use mixtral instead for testing
* fix dtensor and tensor mismatch
* linting
* checkout test tensor parallel to be like main
* avoid hack and create class instead
* fix loading ep
* add moe test
* now EP inference works again but pass still fails
* linting
* now load from checkpoint. Creating a nn.Parameter for param_value will not transfer its attribute (especially _is_hf_initialized)
* forward now works (add LocalPackedColwise + dont use EP router)
* for now test in float32
* dont do all_reduce manually for GatherParellel. Convert to dtensor approach
* Remove dtensor dependency in Tensor Parallel (#43157)
* dense test is passing
* Refactor tensor parallel implementation by removing unused partition_tensor methods
* keep removing dependencies on Dtensor
* rename test file
* Update tensor parallel plans to use "colwise_gather_output" across multiple models
* Remove unused "gather" references and update tensor parallel plans to "colwise_gather_output" in multiple model configurations.
* Refactor tensor parallel plans in Fbgemm and FineGrained quantizers by removing unused configurations and comments related to "gather" operations.
* add 'split_input' option in RowwiseParallel + replace rowwise_replicate 'rowwise_split_input'
* Add PackedColwiseParallel and PackedRowwiseParallel + Update configuration plans
* mixing files and some fix for tp and tp_plan
* clean tensor paralle api
* linting
* linting
* Refactor core model loading and tensor parallel utilities. Improved parameter handling in `set_param_for_module` and updated tensor sharding functions. Removed deprecated code and added new utility functions for block size calculations.
* code quality
* make fixup
* tp workf for dense and moe in float32 only
* fix merge conflicts that broke TP
* revert parsing for tp plan
* all reduce after experts
* compile compatible dist ops
* fix gate_up_proj gradient test by doing splitting thtat takes into account that it is fused + all_reduce to get full gradient before functional.linear
* fix moe backward fp32
* remove functional.Linear to use nn.Linear in experts (this way we attach hooks)
* moe work with tied embedding as well
* style
* all tests pass
* make fix-up
* typo
* use transformer seed + pytest parametrized
* Moved weight and bias dim mapping to ParallelInterface
* simplifed shard tensor signature
* sync shard_tensor logic with the one in origin/main
* add function check to avoid mismatch check during set_param_for_module
* remove disable. I was in an older torch version
* Add pytest skip condition for tensor parallel tests requiring PyTorch >= 2.9
* linting
* linting
* fixing remaining modular
* linting
* Refactor get_expected_sharded_shape to be only one call
* Remove redundant prepare_module_tp method from TensorParallelLayer subclasses
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>