Transformer{DecoderLayer} : no batch dim (#70322)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60585
TransformerDecoder Test Timings (takes about 30s)
<details>
```
pytest test/test_modules.py -k _TransformerDeco --durations=10
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 639 items / 591 deselected / 48 selected
test/test_modules.py ss......ss......ss..ssssssssss.................. [100%]
================================================================================================================================================================================ slowest 10 durations ==============================================================================================
17.13s call test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_TransformerDecoderLayer_cuda_float64
4.13s call test/test_modules.py::TestModuleCPU::test_gradgrad_nn_TransformerDecoderLayer_cpu_float64
1.22s call test/test_modules.py::TestModuleCUDA::test_grad_nn_TransformerDecoderLayer_cuda_float64
0.86s call test/test_modules.py::TestModuleCPU::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cpu_float32
0.73s call test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float32
0.57s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerDecoderLayer_cuda_float32
0.56s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_TransformerDecoderLayer_cuda_float64
0.48s call test/test_modules.py::TestModuleCPU::test_grad_nn_TransformerDecoderLayer_cpu_float64
0.41s call test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_TransformerDecoderLayer_cuda_float32
0.40s call test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_TransformerDecoderLayer_cuda_float64
============================================================================================ short test summary info =============================================================================================
========================================================================== 32 passed, 16 skipped, 591 deselected, 3 warnings in 29.62s ===========================================================================
```
</details>
Transformer Test Timings (takes about 1m10s)
<details>
```
pytest test/test_modules.py -k _Transformer_ --durations=10
============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.10.0, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
rootdir: /home/kshiteej/Pytorch/pytorch_no_batch_mha, configfile: pytest.ini
plugins: hypothesis-6.23.2, repeat-0.9.1
collected 639 items / 591 deselected / 48 selected
test/test_modules.py ss......ss......ss..ssssssssss.................. [100%]
==================================================================================
============================================================================================== slowest 10 durations ==============================================================================================
46.40s call test/test_modules.py::TestModuleCUDA::test_gradgrad_nn_Transformer_cuda_float64
11.09s call test/test_modules.py::TestModuleCPU::test_gradgrad_nn_Transformer_cpu_float64
2.48s call test/test_modules.py::TestModuleCUDA::test_grad_nn_Transformer_cuda_float64
1.03s call test/test_modules.py::TestModuleCPU::test_grad_nn_Transformer_cpu_float64
0.96s call test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Transformer_cuda_float32
0.87s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Transformer_cuda_float32
0.85s call test/test_modules.py::TestModuleCUDA::test_non_contiguous_tensors_nn_Transformer_cuda_float64
0.85s call test/test_modules.py::TestModuleCPU::test_cpu_gpu_parity_nn_Transformer_cpu_float32
0.65s call test/test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_Transformer_cuda_float64
0.47s call test/test_modules.py::TestModuleCUDA::test_multiple_device_transfer_nn_Transformer_cuda_float32
============================================================================================ short test summary info =============================================================================================
===================================================================== 32 passed, 16 skipped, 591 deselected, 3 warnings in 70.19s (0:01:10) ======================================================================
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70322
Reviewed By: cpuhrsch
Differential Revision: D33286285
Pulled By: jbschlosser
fbshipit-source-id: 46e08cf47f37787733a535f683c3fd21f652486d