faster generate_square_subsequent_mask in nn.Transformer (#60631)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631
Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small.
PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, albanD
Differential Revision: D29356673
Pulled By: bhosmer
fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e