Add fused layer norm impl on CUDA in PyTorch (#27634)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27634
Add fused layer norm impl on CUDA in PyTorch
Performance benchmark compare to apex.FusedLayerNorm on a V100 machine.
**************************************
Shape = (128, 2097152)
curr LayerNorm forward: 7.252584544941783ms
apex LayerNorm forward: 10.366813436849043ms
curr LayerNorm backward: 15.568048988003284ms
apex LayerNorm backward: 20.869979876093566ms
**************************************
Shape = (256, 1048576)
curr LayerNorm forward: 5.185673736967146ms
apex LayerNorm forward: 6.3868385690730065ms
curr LayerNorm backward: 13.942008479032665ms
apex LayerNorm backward: 15.469660016940907ms
**************************************
Shape = (512, 524288)
curr LayerNorm forward: 4.672068868065253ms
apex LayerNorm forward: 4.717993081081659ms
curr LayerNorm backward: 13.46354596503079ms
apex LayerNorm backward: 14.04774487693794ms
**************************************
Shape = (1024, 262144)
curr LayerNorm forward: 4.547273400006816ms
apex LayerNorm forward: 5.378365494078025ms
curr LayerNorm backward: 13.425063178874552ms
apex LayerNorm backward: 14.235145597020164ms
**************************************
Shape = (2048, 131072)
curr LayerNorm forward: 4.526399010093883ms
apex LayerNorm forward: 4.775081946980208ms
curr LayerNorm backward: 13.222738380078226ms
apex LayerNorm backward: 13.59594238596037ms
**************************************
Shape = (4096, 65536)
curr LayerNorm forward: 4.28789056581445ms
apex LayerNorm forward: 4.48913648002781ms
curr LayerNorm backward: 13.026655421825126ms
apex LayerNorm backward: 13.57052089786157ms
**************************************
Shape = (8192, 32768)
curr LayerNorm forward: 4.243518367875367ms
apex LayerNorm forward: 4.34588153520599ms
curr LayerNorm backward: 13.140627697808668ms
apex LayerNorm backward: 13.49891544203274ms
**************************************
Shape = (16384, 16384)
curr LayerNorm forward: 4.181216162163764ms
apex LayerNorm forward: 4.268723972840235ms
curr LayerNorm backward: 13.035593512002379ms
apex LayerNorm backward: 13.463351831072941ms
**************************************
Shape = (32768, 8192)
curr LayerNorm forward: 4.097899778978899ms
apex LayerNorm forward: 4.109480210812762ms
curr LayerNorm backward: 13.041268918896094ms
apex LayerNorm backward: 13.586135944118723ms
Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "LayerNorm"
Reviewed By: houseroad
Differential Revision: D17462420
fbshipit-source-id: d4a67d160bb4eff73ffac64af46c56c3845cf211