Re-apply "[bert/RoBERTa] Optimize LayerNorm with explicit vectorization using Vec256" (#31127)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31127
Original commit changeset: d22448b90843
On Skylake T6:
Single Core:
(Note that our benchmark generates batch_size=47 for first case and batch_size=56 for the second case. In spite of that, the vectorized version is still faster than the original reference C version without vectorization.)
- Before the PR:
```
native_layer_norm 0.81% 5.884ms 0.81% 5.884ms 122.580us NaN 0.000us 0.000us 48 [[47, 1, 1024], [1024], [1024]]
```
- After the PR:
```
native_layer_norm 0.68% 5.053ms 0.68% 5.053ms 105.272us NaN 0.000us 0.000us 48 [[56, 1, 1024], [1024], [1024]]
```
20 Cores:
- Before the PR:
```
native_layer_norm 1.65% 41.682ms 1.65% 41.682ms 868.365us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]]
```
- After the PR:
```
native_layer_norm 1.34% 33.829ms 1.34% 33.829ms 704.771us NaN 0.000us 0.000us 48 [[61, 64, 1024], [1024], [1024]]
```
ghstack-source-id: 95420889
Test Plan:
buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"
buck test mode/dev-nosan //caffe2/test:nn -- "test_LayerNorm_1d_no_elementwise_affine_eval"
python run_test.py -i nn -- TestNN.test_LayerNorm_1d_no_elementwise_affine_eval
Differential Revision: D18936428
fbshipit-source-id: 8cae33d35fb338b5ac49b1597c2709152612d6e5