Add feature to increase the number of host to device transfer threads (#4693)
* Add feature to increase the number of host to device transfer threads
* Revert test set batch_size to 64
* Rename the config name to OPTIMIZED_KWARGS_v4
* Change description to v4 instead of just v4-8 as this config imporves resnet performance even on v4 slices and pods
* remove extra line
* Add flag to switch v4 optimized config
* Modify name to more generalized way to keep it open for v5 config as well
* Add flexibility to define multiple configs based on TPU versions
* Keep the command consistent