Fix timm_nfnet train batch size and improve the entire model quality (#585)
Summary:
# Train
batch size: 128
Source
- https://github.com/rwightman/pytorch-image-models/blob/master/train.py#L114
- https://gist.github.com/rwightman/bb59f9e245162cee0e38bd66bd8cd77f#file-bench_by_train-csv-L147
## Latency
GPU Time: 2767.694 milliseconds
CPU Dispatch Time: 2767.542 milliseconds
CPU Total Wall Time: 2767.665 milliseconds
## Profile

# Eval
batch size: 256
- source: https://github.com/rwightman/pytorch-image-models/blob/master/results/model_benchmark_amp_nchw_rtx3090.csv
## Latency
GPU Time: 2209.692 milliseconds
CPU Dispatch Time: 15.806 milliseconds
CPU Total Wall Time: 2209.685 milliseconds
## Profile

Pull Request resolved: https://github.com/pytorch/benchmark/pull/585
Reviewed By: aaronenyeshi
Differential Revision: D33023646
Pulled By: xuzhao9
fbshipit-source-id: a62a408659a2c88ba46a503b7cea8c7baba616c2