Add initial cpu userbenchmark for torchbench (#1559)
Summary:
Add initial cpu userbenchmark for torchbench
Works for Roadmap https://github.com/pytorch/benchmark/issues/1293 for cpu userbenchmark extend with below functions.
- [x] Add core binding option, support multi-instances test.
- [x] Add gomp/iomp option.
- [x] Add memory allocator option.
- [x] Support all enabled cpu features test based on torchbench models, e.g. channels-last / fx_int8 / jit with fusers
- [x] Support latency and cpu_peak_mem metrics for now, will extend to fps-like report
- [x] Add `README.md`
For example, in below cml, we tested 2 models fx_int8 inference with batch size 8 on CLX socket 0 and 4 instances at the same time.
```shell
$ python run_benchmark.py cpu --model resnet50,alexnet --test eval -b 8 --precision fx_int8 --launcher --launcher-args "--node-id 0 --ninstances 4"
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'resnet50', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:37,960 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:37,960 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:37,960 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:37,960 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:37,960 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:37,960 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m resnet50 --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='resnet50', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
[Done]
[Done]
[Done]
Running benchmark: ['/localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python', '-m', 'torch.backends.xeon.run_cpu', '--node-id', '0', '--ninstances', '4', '/localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py', '-m', 'alexnet', '--device', 'cpu', '-b', '8', '-t', 'eval', '-o', PosixPath('/localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336')]
2023-04-20 00:43:53,444 - __main__ - INFO - Use JeMalloc memory allocator
2023-04-20 00:43:53,444 - __main__ - INFO - OMP_NUM_THREADS=7
2023-04-20 00:43:53,444 - __main__ - INFO - Using Intel OpenMP
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_AFFINITY=granularity=fine,compact,1,0
2023-04-20 00:43:53,444 - __main__ - INFO - KMP_BLOCKTIME=1
2023-04-20 00:43:53,444 - __main__ - INFO - LD_PRELOAD=/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libiomp5.so:/localdisk/chuanqiw/miniconda3/envs/torchdynamo/lib/libjemalloc.so
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 0-6 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 7-13 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 14-20 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
2023-04-20 00:43:53,445 - __main__ - INFO - numactl -C 21-27 -m 0 /localdisk/chuanqiw/miniconda3/envs/torchdynamo/bin/python -u /localdisk/chuanqiw/PT/benchmark/userbenchmark/cpu/run_config.py -m alexnet --device cpu -b 8 -t eval -o /localdisk/chuanqiw/PT/benchmark/.userbenchmark/cpu/cpu-20230420004336
Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ...Running TorchBenchModelConfig(name='alexnet', device='cpu', test='eval', batch_size=8, jit=False, extra_args=[], extra_env=None) ... [Done]
[Done]
[Done]
[Done]
```
We can find the test results in `.userbenchmark/cpu/cpu-20230420004336`, `cpu` userbenchmark will create a subfolder for each test, and aggregate all test results into `metrics-20230420004336.json`. For each sub-folder, it contains instances logs named with instance PID for that model test.
```shell
$ ls .userbenchmark/cpu/cpu-20230420004336
eval_alexnet_eager/ eval_resnet50_eager/
$ ls .userbenchmark/cpu/cpu-20230420004336/eval_alexnet_eager/
metrics-3347653.json metrics-3347654.json metrics-3347655.json metrics-3347656.json
$ cat .userbenchmark/cpu/metrics-20230420004336.json
{
"name": "cpu",
"environ": {
"pytorch_git_version": "de1114554c38322273c066c091d455519d45472d"
},
"metrics": {
"alexnet-eval-eager_latency": 58.309660750000006,
"alexnet-eval-eager_cmem": 0.416259765625,
"resnet50-eval-eager_latency": 335.04970325,
"resnet50-eval-eager_cmem": 0.90673828125
}
}
```
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1559
Reviewed By: aaronenyeshi
Differential Revision: D45450175
Pulled By: xuzhao9
fbshipit-source-id: 8e7528f4d694eae182ee601cd80bc6e57cd14e3c