Fix GPU utilization issue of resnext50_32x4d model (#550)
Summary:
# Eval
## Batch scaling analysis
<google-sheets-html-origin>
Batch Size | GPU Time | CPU Dispatch Time | Walltime | GPU Delta
-- | -- | -- | -- | --
1 | 20.14 | 20.07 | 20.151 | -
2 | 21.968 | 21.886 | 21.98 | 0.09076464747
4 | 21.776 | 21.706 | 21.796 | -0.008739985433
8 | 39.197 | 24.668 | 39.202 | 0.8000091844
16 | 68.581 | 23.632 | 68.594 | 0.7496492078
32 | 135.644 | 26.935 | 135.663 | 0.9778655896
64 | 254.138 | 22.224 | 254.139 | 0.8735660995
best bs=8
## Non-idleness analysis

# Train
## Batch scaling analysis
<google-sheets-html-origin>
Batch Size | GPU Time | CPU Dispatch Time | Walltime | GPU Delta
-- | -- | -- | -- | --
1 | 198.385 | 190.304 | 198.387 | -
2 | 249.487 | 242.669 | 249.49 | 0.2575900396
4 | 369.653 | 361.657 | 369.65 | 0.4816523506
8 | 597.16 | 589.141 | 597.152 | 0.6154609864
16 | 1227.223 | 1220.652 | 1227.2 | 1.055099136
32 | 2417.154 | 2410.771 | 2417.101 | 0.9696126947
64 | 4719.082 | 4711.292 | 4718.974 | 0.9523298888
best bs=8
## Non-idleness analysis

STABLE_TEST_MODEL: resnext50_32x4d
Pull Request resolved: https://github.com/pytorch/benchmark/pull/550
Reviewed By: aaronenyeshi
Differential Revision: D32286373
Pulled By: xuzhao9
fbshipit-source-id: 65bc2c94dd370070232b0cce180b271debe5e93d