Revise the look of test_train* scripts' stdout. (#1825)
* Revise the look of test_train* scripts stdout.
While verifying performance of resnet50, I had trouble calculating epoch
times from the logs easily. I thought if we had lines to grep that
easily gives information, it would be useful.
* Changed the training and test updates.
* Added training begin, end, validation end statements per epoch.
* Added rendezvous so prints look sort of in sync. (still some
scrambling happening, master print comes later than non-master etc.)
* Now we can grep and easily understand:
* epoch begin/end
* split epoch time into training/validation
* which accuracy belongs to which epoch
* how many steps to go during training until the end of epoch
* Below is how stdout of MP-imgnet looks:
```
[taylanbil]% cat mp-imgnet.txt
==> Preparing data..
| Training Device=xla:0/6 Epoch=1 Step=0/11 Loss=6.89059 Rate=141.80
GlobalRate=141.80 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/4 Epoch=1 Step=0/11 Loss=6.89059 Rate=84.10
GlobalRate=84.09 Time=00:03:00
==> Preparing data..
Epoch 1 training begin 00:02:56
| Training Device=xla:1/0 Epoch=1 Step=0/11 Loss=6.89059 Rate=32.79
GlobalRate=32.79 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/3 Epoch=1 Step=0/11 Loss=6.89059 Rate=180.53
GlobalRate=180.53 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/7 Epoch=1 Step=0/11 Loss=6.89059 Rate=117.95
GlobalRate=117.95 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/2 Epoch=1 Step=0/11 Loss=6.89059 Rate=122.05
GlobalRate=122.05 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/1 Epoch=1 Step=0/11 Loss=6.89059 Rate=56.54
GlobalRate=56.54 Time=00:03:00
==> Preparing data..
| Training Device=xla:0/5 Epoch=1 Step=0/11 Loss=6.89059 Rate=80.66
GlobalRate=80.66 Time=00:03:00
| Training Device=xla:1/0 Epoch=1 Step=2/11 Loss=6.50214 Rate=243.80
GlobalRate=84.03 Time=00:03:01
| Training Device=xla:0/3 Epoch=1 Step=2/11 Loss=6.50214 Rate=302.39
GlobalRate=279.00 Time=00:03:01
| Training Device=xla:0/2 Epoch=1 Step=2/11 Loss=6.50214 Rate=279.05
GlobalRate=223.79 Time=00:03:01
| Training Device=xla:0/5 Epoch=1 Step=2/11 Loss=6.50214 Rate=265.45
GlobalRate=171.00 Time=00:03:01
| Training Device=xla:0/4 Epoch=1 Step=2/11 Loss=6.50214 Rate=263.41
GlobalRate=175.30 Time=00:03:01
.... (removed)
| Training Device=xla:0/2 Epoch=1 Step=10/11 Loss=0.00128 Rate=512.78
GlobalRate=377.78 Time=00:03:03
| Training Device=xla:0/4 Epoch=1 Step=10/11 Loss=0.00128 Rate=512.13
GlobalRate=335.02 Time=00:03:03
| Training Device=xla:0/1 Epoch=1 Step=10/11 Loss=0.00128 Rate=510.39
GlobalRate=285.26 Time=00:03:03
| Test Device=xla:0/3 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/6 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/5 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/2 Step=0/4 Epoch=1 Time=00:03:03
Epoch 1 training end 00:03:03
| Test Device=xla:1/0 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/1 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/4 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/3 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/6 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/5 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:1/0 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/2 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/4 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/1 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/7 Step=0/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/3 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/6 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/5 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/2 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:1/0 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/4 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/1 Epoch=1 Accuracy=100.00 Time=00:03:03
| Test Device=xla:0/7 Step=2/4 Epoch=1 Time=00:03:03
| Test Device=xla:0/7 Epoch=1 Accuracy=100.00 Time=00:03:03
| Training Device=xla:0/5 Epoch=2 Step=0/11 Loss=0.00020 Rate=160.09
GlobalRate=160.09 Time=00:03:04
| Training Device=xla:0/1 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.97
GlobalRate=159.96 Time=00:03:04
| Training Device=xla:0/6 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.90
GlobalRate=159.90 Time=00:03:04
| Training Device=xla:0/4 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.99
GlobalRate=159.99 Time=00:03:04
Epoch 1 validation end 00:03:03
Epoch 2 training begin 00:03:03
| Training Device=xla:1/0 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.83
GlobalRate=159.83 Time=00:03:04
| Training Device=xla:0/2 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.79
GlobalRate=159.79 Time=00:03:04
| Training Device=xla:0/3 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.80
GlobalRate=159.79 Time=00:03:04
| Training Device=xla:0/7 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.28
GlobalRate=159.28 Time=00:03:04
| Training Device=xla:0/5 Epoch=2 Step=2/11 Loss=0.00001 Rate=380.40
GlobalRate=298.81 Time=00:03:05
| Training Device=xla:0/3 Epoch=2 Step=2/11 Loss=0.00001 Rate=381.09
GlobalRate=298.76 Time=00:03:05
| Training Device=xla:0/7 Epoch=2 Step=2/11 Loss=0.00001 Rate=381.71
GlobalRate=298.45 Time=00:03:05
.... (removed)
| Training Device=xla:0/3 Epoch=2 Step=10/11 Loss=0.00000 Rate=514.62
GlobalRate=431.63 Time=00:03:07
| Training Device=xla:0/1 Epoch=2 Step=10/11 Loss=0.00000 Rate=513.43
GlobalRate=431.64 Time=00:03:07
| Training Device=xla:0/6 Epoch=2 Step=10/11 Loss=0.00000 Rate=510.92
GlobalRate=431.16 Time=00:03:07
| Test Device=xla:0/7 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/3 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/6 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/1 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/4 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/5 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/2 Step=0/4 Epoch=2 Time=00:03:07
Epoch 2 training end 00:03:07
| Test Device=xla:1/0 Step=0/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/7 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/3 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/6 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/1 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/4 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/5 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/2 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:1/0 Step=2/4 Epoch=2 Time=00:03:07
| Test Device=xla:0/7 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/6 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/3 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/4 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/1 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/5 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:0/2 Epoch=2 Accuracy=100.00 Time=00:03:07
| Test Device=xla:1/0 Epoch=2 Accuracy=100.00 Time=00:03:07
Epoch 2 validation end 00:03:07
Max Accuracy: 100.00%
```
[Here](https://drive.google.com/a/google.com/file/d/1CaJrpiHvXK3s09rQbA_Uimfm5jb-CMOi/view?usp=sharing) are sample outputs from out MT/MP-mnist/imgnet and MT-cifar scripts.
* Address some comments.
* Applied pyformat
* Print all non-None items even if they bool to False.
* Print all non-None items even if they bool to False - fixed.