xla
36b3ca5b - Revise the look of test_train* scripts' stdout. (#1825)

Commit
5 years ago
Revise the look of test_train* scripts' stdout. (#1825) * Revise the look of test_train* scripts stdout. While verifying performance of resnet50, I had trouble calculating epoch times from the logs easily. I thought if we had lines to grep that easily gives information, it would be useful. * Changed the training and test updates. * Added training begin, end, validation end statements per epoch. * Added rendezvous so prints look sort of in sync. (still some scrambling happening, master print comes later than non-master etc.) * Now we can grep and easily understand: * epoch begin/end * split epoch time into training/validation * which accuracy belongs to which epoch * how many steps to go during training until the end of epoch * Below is how stdout of MP-imgnet looks: ``` [taylanbil]% cat mp-imgnet.txt ==> Preparing data.. | Training Device=xla:0/6 Epoch=1 Step=0/11 Loss=6.89059 Rate=141.80 GlobalRate=141.80 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/4 Epoch=1 Step=0/11 Loss=6.89059 Rate=84.10 GlobalRate=84.09 Time=00:03:00 ==> Preparing data.. Epoch 1 training begin 00:02:56 | Training Device=xla:1/0 Epoch=1 Step=0/11 Loss=6.89059 Rate=32.79 GlobalRate=32.79 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/3 Epoch=1 Step=0/11 Loss=6.89059 Rate=180.53 GlobalRate=180.53 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/7 Epoch=1 Step=0/11 Loss=6.89059 Rate=117.95 GlobalRate=117.95 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/2 Epoch=1 Step=0/11 Loss=6.89059 Rate=122.05 GlobalRate=122.05 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/1 Epoch=1 Step=0/11 Loss=6.89059 Rate=56.54 GlobalRate=56.54 Time=00:03:00 ==> Preparing data.. | Training Device=xla:0/5 Epoch=1 Step=0/11 Loss=6.89059 Rate=80.66 GlobalRate=80.66 Time=00:03:00 | Training Device=xla:1/0 Epoch=1 Step=2/11 Loss=6.50214 Rate=243.80 GlobalRate=84.03 Time=00:03:01 | Training Device=xla:0/3 Epoch=1 Step=2/11 Loss=6.50214 Rate=302.39 GlobalRate=279.00 Time=00:03:01 | Training Device=xla:0/2 Epoch=1 Step=2/11 Loss=6.50214 Rate=279.05 GlobalRate=223.79 Time=00:03:01 | Training Device=xla:0/5 Epoch=1 Step=2/11 Loss=6.50214 Rate=265.45 GlobalRate=171.00 Time=00:03:01 | Training Device=xla:0/4 Epoch=1 Step=2/11 Loss=6.50214 Rate=263.41 GlobalRate=175.30 Time=00:03:01 .... (removed) | Training Device=xla:0/2 Epoch=1 Step=10/11 Loss=0.00128 Rate=512.78 GlobalRate=377.78 Time=00:03:03 | Training Device=xla:0/4 Epoch=1 Step=10/11 Loss=0.00128 Rate=512.13 GlobalRate=335.02 Time=00:03:03 | Training Device=xla:0/1 Epoch=1 Step=10/11 Loss=0.00128 Rate=510.39 GlobalRate=285.26 Time=00:03:03 | Test Device=xla:0/3 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/6 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/5 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/2 Step=0/4 Epoch=1 Time=00:03:03 Epoch 1 training end 00:03:03 | Test Device=xla:1/0 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/1 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/4 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/3 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/6 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/5 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:1/0 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/2 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/4 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/1 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/7 Step=0/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/3 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/6 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/5 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/2 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:1/0 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/4 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/1 Epoch=1 Accuracy=100.00 Time=00:03:03 | Test Device=xla:0/7 Step=2/4 Epoch=1 Time=00:03:03 | Test Device=xla:0/7 Epoch=1 Accuracy=100.00 Time=00:03:03 | Training Device=xla:0/5 Epoch=2 Step=0/11 Loss=0.00020 Rate=160.09 GlobalRate=160.09 Time=00:03:04 | Training Device=xla:0/1 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.97 GlobalRate=159.96 Time=00:03:04 | Training Device=xla:0/6 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.90 GlobalRate=159.90 Time=00:03:04 | Training Device=xla:0/4 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.99 GlobalRate=159.99 Time=00:03:04 Epoch 1 validation end 00:03:03 Epoch 2 training begin 00:03:03 | Training Device=xla:1/0 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.83 GlobalRate=159.83 Time=00:03:04 | Training Device=xla:0/2 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.79 GlobalRate=159.79 Time=00:03:04 | Training Device=xla:0/3 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.80 GlobalRate=159.79 Time=00:03:04 | Training Device=xla:0/7 Epoch=2 Step=0/11 Loss=0.00020 Rate=159.28 GlobalRate=159.28 Time=00:03:04 | Training Device=xla:0/5 Epoch=2 Step=2/11 Loss=0.00001 Rate=380.40 GlobalRate=298.81 Time=00:03:05 | Training Device=xla:0/3 Epoch=2 Step=2/11 Loss=0.00001 Rate=381.09 GlobalRate=298.76 Time=00:03:05 | Training Device=xla:0/7 Epoch=2 Step=2/11 Loss=0.00001 Rate=381.71 GlobalRate=298.45 Time=00:03:05 .... (removed) | Training Device=xla:0/3 Epoch=2 Step=10/11 Loss=0.00000 Rate=514.62 GlobalRate=431.63 Time=00:03:07 | Training Device=xla:0/1 Epoch=2 Step=10/11 Loss=0.00000 Rate=513.43 GlobalRate=431.64 Time=00:03:07 | Training Device=xla:0/6 Epoch=2 Step=10/11 Loss=0.00000 Rate=510.92 GlobalRate=431.16 Time=00:03:07 | Test Device=xla:0/7 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/3 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/6 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/1 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/4 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/5 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/2 Step=0/4 Epoch=2 Time=00:03:07 Epoch 2 training end 00:03:07 | Test Device=xla:1/0 Step=0/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/7 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/3 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/6 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/1 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/4 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/5 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/2 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:1/0 Step=2/4 Epoch=2 Time=00:03:07 | Test Device=xla:0/7 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/6 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/3 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/4 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/1 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/5 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:0/2 Epoch=2 Accuracy=100.00 Time=00:03:07 | Test Device=xla:1/0 Epoch=2 Accuracy=100.00 Time=00:03:07 Epoch 2 validation end 00:03:07 Max Accuracy: 100.00% ``` [Here](https://drive.google.com/a/google.com/file/d/1CaJrpiHvXK3s09rQbA_Uimfm5jb-CMOi/view?usp=sharing) are sample outputs from out MT/MP-mnist/imgnet and MT-cifar scripts. * Address some comments. * Applied pyformat * Print all non-None items even if they bool to False. * Print all non-None items even if they bool to False - fixed.
Author
Parents
Loading