[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.
I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294