Make MHA Query Scaling Behaviors Consistent (#119323)
The multi-head attention (MHA) query scaling behaviors are not consistent when [`need_weights`](https://github.com/pytorch/pytorch/blob/8ac9b20d4b090c213799e81acf48a55ea8d437d6/torch/nn/modules/activation.py#L1073) values are different.
On the current main, when `need_weights = True`, the query scaling was performed using a [division](https://github.com/pytorch/pytorch/blob/8ac9b20d4b090c213799e81acf48a55ea8d437d6/torch/nn/functional.py#L5434) and it will be exported as a `Div` operator in ONNX. When `need_weights = False`, the query scaling was performed using a [multiplication](https://github.com/pytorch/pytorch/blob/422b4271aeb0d5998fd439711abf881ac8788478/aten/src/ATen/native/transformers/attention.cpp#L711) and it will be exported as a `Mul` operator in ONNX defined in the [PyTorch ONNX Symbolics](https://github.com/pytorch/pytorch/blob/422b4271aeb0d5998fd439711abf881ac8788478/torch/onnx/symbolic_opset14.py#L177).
We should make the query scaling behaviors consistent. On most of the platforms, multiplication performs no worse than division. Therefore, we should use multiplication consistently for both `need_weights = True` and `need_weights = False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119323
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD