xla
197f7b2d - Consider newest heartbeat to compare timeout (#1728)

Commit
5 years ago
Consider newest heartbeat to compare timeout (#1728) Fixes https://github.com/pytorch/xla/issues/1690 taking care of case where: ``` t_0 = 0 last mark_step in epoch 1 t_1=t_0+1 start checkpointing (after epoch 1) t_2=t_1 rendezvous call blocking on others t_3=t_2+uneven_heartbeat_timeout checkpointing done + rendezvous returns t_4=t_3+1 mark_step only by worker 0 (w_0) t_5=t_4+1 check heartbeat process starts -> errors with unhealthy since for all w_i, s.t. i!=0, w_i.last_heartbeat = t_0 and t_4 - t_0 = uneven_heartbeat_timeout + 2 > uneven_heartbeat_timeout So at t_3 we reset w_i.last_heartbeat = now for all i. ```
Author
Parents
Loading