Designate divice to generate_square_subsequent_mask (#85609)
When the model is on GPU, generating the mask on defalut device(cpu) is quite time consuming.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85609
Approved by: https://github.com/albanD