DeepSpeed
0fc19b6a - Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464)

Commit

1 year ago

Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

References

#5464 - Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device()

Author

harygo2

Parents

90793aab

DeepSpeed 0fc19b6a - Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464)

DeepSpeed
0fc19b6a - Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464)