[FSDP2] implement HSDP (#121569)
support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023
HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP)
for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp``
allreduce overlaps with next reduce-scatter in profiler traces
<img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569
Approved by: https://github.com/awgu