Support 3D attention mask in MultiheadAttention. (#31996)
Summary:
Support a 3D attention mask for MultiheadAttention. If `attn_mask` has the batch dimension, it will not be unsqueezed. Fix https://github.com/pytorch/pytorch/issues/30678
Relevant issues/pr:
https://github.com/pytorch/pytorch/pull/25359
https://github.com/pytorch/pytorch/issues/29520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31996
Differential Revision: D19332816
Pulled By: zhangguanheng66
fbshipit-source-id: 3448af4b219607af60e02655affe59997ad212d9