Split is_synchronized_device api to multiple apis (#5026)
Deepspeed currently calls is_synchronized_device() to decide how to use
the device.
HPU does not fit into this definition since it behaves like all streams
are blocking streams,
meaning they preserve order between each other but asynchronous to CPU.
see cudaStreamCreateWithFlags.
**has_data_dependency_resolving()**
HPU device is considered synchronized wrt CPU. Operations executed in
the script order
regardless of stream they were enqueued on. Tensor data is guaranteed to
be valid.
No need to stream dependencies or CPU synchronizations.
**use_host_timers()**
HPU device execution is async. To measure device execution time we must
use device timers.
**has_memory_backpressure()**
limiting number of inflight fetched params and number of inflight grad
reduce_scatter calls
is not necessary since HPU will stop enqueuing calls if memory is full,
creating internal
backpressure for the CPU until memory is available.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>