Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.
Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).
Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225