Fix DLRM batch size argument to be mini_batch_size (#1082)
Summary:
Using num_batches was not changing the batch size properly from a high level script.
Changing batch_size to modify mini_batch_size will now change the actual batch size
in the DLRM architecture.
Here is the default batch_size (2048), took 9 ms:
```
$ python run.py dlrm -d cuda -m eager --profile -t train
Running train method from dlrm on cuda in eager mode with input batch size 2048.
Collecting CUDA activity.
STAGE:2022-08-03 17:03:39 92135:92135 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-08-03 17:03:39 92135:92135 ActivityProfilerController.cpp:300] Completed Stage: Collection
STAGE:2022-08-03 17:03:40 92135:92135 output_json.cpp:417] Completed Stage: Post Processing
STAGE:2022-08-03 17:03:40 92135:92135 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-08-03 17:03:40 92135:92135 ActivityProfilerController.cpp:300] Completed Stage: Collection
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 100.00% 7.000us 100.00% 7.000us 7.000us 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 7.000us
Saved TensorBoard Profiler traces to ./logs.
```

Here is the batch size set to 4096, took 16ms:
```
$ python run.py dlrm -d cuda -m eager --profile -t train --bs 4096
Running train method from dlrm on cuda in eager mode with input batch size 4096.
Collecting CUDA activity.
STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:300] Completed Stage: Collection
STAGE:2022-08-03 17:05:42 92704:92704 output_json.cpp:417] Completed Stage: Post Processing
STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:294] Completed Stage: Warm Up
STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:300] Completed Stage: Collection
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
cudaDeviceSynchronize 100.00% 8.000us 100.00% 8.000us 8.000us 1
------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 8.000us
Saved TensorBoard Profiler traces to ./logs.
```

Pull Request resolved: https://github.com/pytorch/benchmark/pull/1082
Reviewed By: xuzhao9
Differential Revision: D38547946
Pulled By: aaronenyeshi
fbshipit-source-id: 8da0d12906510887b2515fd22766474dbda06f45