Add afmoe model (#42168)
* Add AFMoE model support
* Address review feedback for AFMoE implementation
* Add flex attention support to AFMoE model
* Fix expert_bias routing in AFMoE
* Remove test-results directory
* Address PR review feedback for AFMoE model
* fix(afmoe): ensure RMSNorm output dtype matches input dtype)
* properly return attn weights
* fix most tests
* cleanup
Remove shared expert if else as defaults to 2
Remove `route_norm` as it default to `True`.
Make test smaller faster
* fix input embeds api
* update rope API, smaller test and should be good to go
* oups wront place to skip unittest
* quality
* update
* rope parameter docstring fill
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>