transformers
eaab9f2f - refactor + robusts tests for Tensor Parallel (#42809)

Commit
6 days ago
refactor + robusts tests for Tensor Parallel (#42809) * begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead * fix loading ep * add moe test * now EP inference works again but pass still fails * linting * now load from checkpoint. Creating a nn.Parameter for param_value will not transfer its attribute (especially _is_hf_initialized) * forward now works (add LocalPackedColwise + dont use EP router) * for now test in float32 * dont do all_reduce manually for GatherParellel. Convert to dtensor approach * Remove dtensor dependency in Tensor Parallel (#43157) * dense test is passing * Refactor tensor parallel implementation by removing unused partition_tensor methods * keep removing dependencies on Dtensor * rename test file * Update tensor parallel plans to use "colwise_gather_output" across multiple models * Remove unused "gather" references and update tensor parallel plans to "colwise_gather_output" in multiple model configurations. * Refactor tensor parallel plans in Fbgemm and FineGrained quantizers by removing unused configurations and comments related to "gather" operations. * add 'split_input' option in RowwiseParallel + replace rowwise_replicate 'rowwise_split_input' * Add PackedColwiseParallel and PackedRowwiseParallel + Update configuration plans * mixing files and some fix for tp and tp_plan * clean tensor paralle api * linting * linting * Refactor core model loading and tensor parallel utilities. Improved parameter handling in `set_param_for_module` and updated tensor sharding functions. Removed deprecated code and added new utility functions for block size calculations. * code quality * make fixup * tp workf for dense and moe in float32 only * fix merge conflicts that broke TP * revert parsing for tp plan * all reduce after experts * compile compatible dist ops * fix gate_up_proj gradient test by doing splitting thtat takes into account that it is fused + all_reduce to get full gradient before functional.linear * fix moe backward fp32 * remove functional.Linear to use nn.Linear in experts (this way we attach hooks) * moe work with tied embedding as well * style * all tests pass * make fix-up * typo * use transformer seed + pytest parametrized * Moved weight and bias dim mapping to ParallelInterface * simplifed shard tensor signature * sync shard_tensor logic with the one in origin/main * add function check to avoid mismatch check during set_param_for_module * remove disable. I was in an older torch version * Add pytest skip condition for tensor parallel tests requiring PyTorch >= 2.9 * linting * linting * fixing remaining modular * linting * Refactor get_expected_sharded_shape to be only one call * Remove redundant prepare_module_tp method from TensorParallelLayer subclasses --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Arthur <arthur.zucker@gmail.com>
Author
Parents
Loading