DeepSpeed
784cc26e - Fix Evoformer's multi-arch dispatch root cause (#7881)

Commit

30 days ago

Fix Evoformer's multi-arch dispatch root cause (#7881) Fixes #7863 Replaces #7872 @Flamefire Issue #7863 reports order-dependent failures in Evoformer when building for mixed CUDA architectures. The guard-only approach prevents some bad outputs but does not solve multi-generation packaging requirements. This PR takes the root-cause direction: produce a correct multi-arch binary that can run on pre-Ampere and Ampere+ and select the right kernel family at runtime. With TORCH_CUDA_ARCH_LIST='7.0;8.0': 1. Build is no longer pinned by -DGPU_ARCH; it uses runtime arch dispatch (evoformer_attn.py:33, gemm_kernel_utils.h:53). 1. Runtime chooses implementation by device CC: - CC >= 80 -> Sm80 (Ampere+ path) - CC >= 75 -> Sm75 - CC >= 70 -> Sm70 1. So pre-Ampere uses pre-Ampere kernels, and Ampere+ uses the Ampere-family kernel path. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

References

#7881 - Fix Evoformer's multi-arch dispatch root cause

Author

tohtana

Parents

f88d0f8d

DeepSpeed 784cc26e - Fix Evoformer's multi-arch dispatch root cause (#7881)

DeepSpeed
784cc26e - Fix Evoformer's multi-arch dispatch root cause (#7881)