add lm_head and embed_out tensor parallel (#3962)
* add lm_head and embed_out tensor parallel
* fix load lm_head.weight name issue
* replace all_reduce with inference_all_reduce
* refactor lm_head tensor parallel
---------
Co-authored-by: Chen, Zhenhuan <zhenhuan.chen@intel.com>