Sort and dedupe -gencode flags emitted by op_builder.builder (#8021)
## Summary
- Sort and dedupe `ccs` in `CUDAOpBuilder.compute_capability_args` so
the emitted `-gencode` flags are deterministic regardless of the order
in which architectures appear in `TORCH_CUDA_ARCH_LIST` or
`cross_compile_archs`.
- Matches PyTorch's own canonicalisation, which already sorts the
gencode sequence (noted in #7871 while investigating #7863).
- Also dedupes so repeated arches do not produce duplicate `-gencode`
entries.
## Why
Issue #7871 observed that PyTorch sorts `-gencode` flags but DeepSpeed
emits them in the order entries appear in `TORCH_CUDA_ARCH_LIST`. That
order dependence contributed to the regression discussed in #7863. The
non-JIT branch in `op_builder/builder.py` did not sort or dedupe before
iterating over `self.ccs()`, so calls like
`TORCH_CUDA_ARCH_LIST="8.0;7.5;8.0;7.0"` produced an out-of-order,
duplicated flag sequence. The JIT branch already sorts (line 669), so
this brings the non-JIT branch in line.
## Changes
- `op_builder/builder.py`: after `filter_ccs`, sort and dedupe `ccs` by
numeric `(major, minor)` (stripping any `+PTX` suffix for comparison).
The downstream `+PTX` handling at the emission site is preserved.
- `tests/unit/ops/test_op_builder.py`: new
`test_non_jit_branch_sorts_and_dedupes_gencode_flags` covering the
unsorted + duplicated input case. The existing
`test_non_jit_branch_unchanged` continues to pass.
## Test plan
- [x] `pytest tests/unit/ops/test_op_builder.py -x -v` (7 passed,
including the new test and the prior non-JIT regression test)
- [x] `yapf` (no diff)
- [x] `codespell` (clean)
Fixes #7871
---------
Signed-off-by: Aditya Singh <adisin650@gmail.com>