Fix Evoformer's multi-arch dispatch root cause (#7881)
Fixes #7863
Replaces #7872
@Flamefire
Issue #7863 reports order-dependent failures in Evoformer when building
for mixed CUDA architectures. The guard-only approach prevents some bad
outputs but does not solve multi-generation packaging requirements.
This PR takes the root-cause direction: produce a correct multi-arch
binary that can run on pre-Ampere and Ampere+ and select the right
kernel family at runtime.
With TORCH_CUDA_ARCH_LIST='7.0;8.0':
1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch
dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53).
1. Runtime chooses implementation by device CC:
- CC >= 80 -> Sm80 (Ampere+ path)
- CC >= 75 -> Sm75
- CC >= 70 -> Sm70
1. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the
Ampere-family kernel path.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>