benchmark
9b9668fa - Fix DLRM batch size argument to be mini_batch_size (#1082)

Commit

3 years ago

Fix DLRM batch size argument to be mini_batch_size (#1082) Summary: Using num_batches was not changing the batch size properly from a high level script. Changing batch_size to modify mini_batch_size will now change the actual batch size in the DLRM architecture. Here is the default batch_size (2048), took 9 ms: ``` $ python run.py dlrm -d cuda -m eager --profile -t train Running train method from dlrm on cuda in eager mode with input batch size 2048. Collecting CUDA activity. STAGE:2022-08-03 17:03:39 92135:92135 ActivityProfilerController.cpp:294] Completed Stage: Warm Up STAGE:2022-08-03 17:03:39 92135:92135 ActivityProfilerController.cpp:300] Completed Stage: Collection STAGE:2022-08-03 17:03:40 92135:92135 output_json.cpp:417] Completed Stage: Post Processing STAGE:2022-08-03 17:03:40 92135:92135 ActivityProfilerController.cpp:294] Completed Stage: Warm Up STAGE:2022-08-03 17:03:40 92135:92135 ActivityProfilerController.cpp:300] Completed Stage: Collection ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 100.00% 7.000us 100.00% 7.000us 7.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 7.000us Saved TensorBoard Profiler traces to ./logs. ``` ![image](https://user-images.githubusercontent.com/17602366/182669173-65bccdda-238f-441f-acfc-f3a8a03079ae.png) Here is the batch size set to 4096, took 16ms: ``` $ python run.py dlrm -d cuda -m eager --profile -t train --bs 4096 Running train method from dlrm on cuda in eager mode with input batch size 4096. Collecting CUDA activity. STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:294] Completed Stage: Warm Up STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:300] Completed Stage: Collection STAGE:2022-08-03 17:05:42 92704:92704 output_json.cpp:417] Completed Stage: Post Processing STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:294] Completed Stage: Warm Up STAGE:2022-08-03 17:05:42 92704:92704 ActivityProfilerController.cpp:300] Completed Stage: Collection ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ cudaDeviceSynchronize 100.00% 8.000us 100.00% 8.000us 8.000us 1 ------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 8.000us Saved TensorBoard Profiler traces to ./logs. ``` ![image](https://user-images.githubusercontent.com/17602366/182668585-768f2a75-1312-4407-853e-f0d102a39ab1.png) Pull Request resolved: https://github.com/pytorch/benchmark/pull/1082 Reviewed By: xuzhao9 Differential Revision: D38547946 Pulled By: aaronenyeshi fbshipit-source-id: 8da0d12906510887b2515fd22766474dbda06f45

Author

aaronenyeshi

Committer

facebook-github-bot

Parents

bf84f487

benchmark 9b9668fa - Fix DLRM batch size argument to be mini_batch_size (#1082)

benchmark
9b9668fa - Fix DLRM batch size argument to be mini_batch_size (#1082)