benchmark
43d8a999 - Always obtain `model_flops` from the eager mode (#2390)

Commit
1 year ago
Always obtain `model_flops` from the eager mode (#2390) Summary: Flops counter does not support running in inductor mode, so we will first obtain the model flops from the eager mode, then use the latency of the inductor mode to calculate the final flops. Fixes https://github.com/pytorch/benchmark/issues/2383 Pull Request resolved: https://github.com/pytorch/benchmark/pull/2390 Test Plan: ``` python run.py resnet50 --metrics model_flops -d cuda -t train Module FLOP % Total ----------------------------- --------- --------- ResNet 1039.278B 100.00% - aten.convolution 523.153B 50.34% - aten.addmm 0.262B 0.03% - aten.mm 0.262B 0.03% - aten.convolution_backward 515.601B 49.61% ResNet.conv1 22.659B 2.18% - aten.convolution 15.106B 1.45% - aten.convolution_backward 7.553B 0.73% ResNet.fc 0.524B 0.05% - aten.addmm 0.262B 0.03% - aten.mm 0.262B 0.03% ResNet.layer1 170.993B 16.45% - aten.convolution 85.497B 8.23% - aten.convolution_backward 85.497B 8.23% ResNet.layer2 263.067B 25.31% - aten.convolution 131.533B 12.66% - aten.convolution_backward 131.533B 12.66% ResNet.layer3 374.870B 36.07% - aten.convolution 187.435B 18.04% - aten.convolution_backward 187.435B 18.04% ResNet.layer4 207.165B 19.93% - aten.convolution 103.583B 9.97% - aten.convolution_backward 103.583B 9.97% Running train method from resnet50 on cuda in eager mode with input batch size 32 and precision fp32. GPU Time per batch: 90.353 milliseconds CPU Wall Time per batch: 90.369 milliseconds CPU Wall Time: 90.369 milliseconds Model Flops: 11.5004 TFLOPs per second ``` ``` python run.py resnet50 --metrics model_flops -d cuda -t train --inductor Module FLOP % Total ----------------------------- --------- --------- ResNet 1039.278B 100.00% - aten.convolution 523.153B 50.34% - aten.addmm 0.262B 0.03% - aten.mm 0.262B 0.03% - aten.convolution_backward 515.601B 49.61% ResNet.conv1 22.659B 2.18% - aten.convolution 15.106B 1.45% - aten.convolution_backward 7.553B 0.73% ResNet.fc 0.524B 0.05% - aten.addmm 0.262B 0.03% - aten.mm 0.262B 0.03% ResNet.layer1 170.993B 16.45% - aten.convolution 85.497B 8.23% - aten.convolution_backward 85.497B 8.23% ResNet.layer2 263.067B 25.31% - aten.convolution 131.533B 12.66% - aten.convolution_backward 131.533B 12.66% ResNet.layer3 374.870B 36.07% - aten.convolution 187.435B 18.04% - aten.convolution_backward 187.435B 18.04% ResNet.layer4 207.165B 19.93% - aten.convolution 103.583B 9.97% - aten.convolution_backward 103.583B 9.97% Running train method from resnet50 on cuda in dynamo inductor mode with input batch size 32 and precision fp32. /home/xz/miniconda3/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:150: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( GPU Time per batch: 59.166 milliseconds CPU Wall Time per batch: 59.211 milliseconds CPU Wall Time: 59.211 milliseconds Model Flops: 17.5521 TFLOPs per second PT2 Compilation time: 18.833 seconds ``` Reviewed By: FindHao Differential Revision: D60048189 Pulled By: xuzhao9 fbshipit-source-id: 1cf150fe4f2c5d95709cecbb30179c38246af781
Author
Parents
Loading