[AutoTP] Make AutoTP work when num_heads not divisible by number of workers (#4011)
* allow number of heads not divisible by number of ranks
* get num_heads from model config, more robust
* simplify logic where num_head itself is sharded
* name tweaks
* make code more robust where num_attention_heads may not be defined in model_config
* support num_key_value_heads < num_attention_heads which is used by llama2
* add test for 5 ranks
* change odd rank # to 3 to avoid test skip
* add get_shard_size function
* modify sharding mechanism according to latest auto TP
* fix accuracy issue
* fix format
* skip tests with fusedqkv
* remove skip of fusedqkv tests
* skip test fusedqkv with odd number of ranks
* support model with n_heads in model_config
* fix TestInjectionPolicy::test[fp32-t5]
* fix uneven_heads on some fusedqkv types (#12)
* odd support fusedqkv
* fix format and clear text
* better fix when activation size cannot be divided by number of heads
* move tp_shard.py under module_inject
* Add get_num_kv_heads in tp_shard.py
* Refine according to comments
* remove old comment
* fix bug in getting num_kv_heads
* support uneven sharding of lm_head tensor parallel
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: mzl <mingzhi.liu@intel.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>