configurable pre/post LayerNorm in nn.Transformer (#60593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60593
Per #55270, this PR makes it configurable whether to run LayerNorm before or after other operations in Transformer layers.
However, it leaves for a separate PR the removal of the LayerNorm performed after the final encoder/decoder layer has run, which is redundant when LayerNorms has been run after other in-layer operations (problem described in #24930 #50086 #51447).
Note: this means that transformers built with `nn.Transformer()` are now configurable, but will still contain a redundant LayerNorm when configured as before. However, callers of the `TransformerEncoder` and `TransformerDecoder` classes have always been able to avoid this redundancy.
Reviewer notes:
1. Ran across this during other work, don't know if anybody's working on it already (most recent conversation in issues seems to be from early April). Happy to abandon if so.
2. Was looking for a quick way to add tests but it looks like the existing ones in test_nn just compare against snapshots. I could add something similar, but curious if there's any prepackaged way to add a test that LayerNorm-first (the new option) yields model that trains properly, etc.
3. New code in the `forward`s was written to minimize diff churn rather than maximize beauty :P happy to pretty it up if desired.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29356590
Pulled By: bhosmer
fbshipit-source-id: 308669326990b8923aab5fcd96e03b582fb21f24