add moe topk(k>2) gate support (#5881)
Notice some users need to use topk > 2 to train MoE models. For example:
https://huggingface.co/Qwen/Qwen2-57B-A14B/blob/main/config.json, this
PR adds support for topk (k > 2) gates.
- add topk (k>2) support
- add drop token policy based on position and probabilities.
- unit tests
---------
Co-authored-by: Kurt Chen <kurt.chen@intel.com>
Co-authored-by: Jin, Youzhi <youzhi.jin@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>