Optimize LayerNorm performance on CPU both forward and backward (#35750)
Summary:
This PR aims at improving `LayerNorm` performance on CPU for both forward and backward.
Results on Xeon 6248:
1. single socket inference **1.14x** improvement
2. single core inference **1.77x** improvement
3. single socket training **6.27x** improvement
The fine tuning of GPT2 on WikiTest2 dataset time per iteration on dual socket reduced from **4.69s/it** to **3.16s/it**, **1.48x** improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35750
Reviewed By: zhangguanheng66
Differential Revision: D20810026
Pulled By: glaringlee
fbshipit-source-id: c5801bd76eb944f2e46c2fe4991d9ad4f40495c3