DeepSpeed
697f945a - Split is_synchronized_device api to multiple apis (#5026)

Commit

2 years ago

Split is_synchronized_device api to multiple apis (#5026) Deepspeed currently calls is_synchronized_device() to decide how to use the device. HPU does not fit into this definition since it behaves like all streams are blocking streams, meaning they preserve order between each other but asynchronous to CPU. see cudaStreamCreateWithFlags. **has_data_dependency_resolving()** HPU device is considered synchronized wrt CPU. Operations executed in the script order regardless of stream they were enqueued on. Tensor data is guaranteed to be valid. No need to stream dependencies or CPU synchronizations. **use_host_timers()** HPU device execution is async. To measure device execution time we must use device timers. **has_memory_backpressure()** limiting number of inflight fetched params and number of inflight grad reduce_scatter calls is not necessary since HPU will stop enqueuing calls if memory is full, creating internal backpressure for the CPU until memory is available. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

References

#5026 - Split is_synchronized_device api to multiple apis

Author

BacharL

Parents

3255569b

DeepSpeed 697f945a - Split is_synchronized_device api to multiple apis (#5026)

DeepSpeed
697f945a - Split is_synchronized_device api to multiple apis (#5026)