Always obtain `model_flops` from the eager mode (#2390)
Summary:
Flops counter does not support running in inductor mode, so we will first obtain the model flops from the eager mode, then use the latency of the inductor mode to calculate the final flops.
Fixes https://github.com/pytorch/benchmark/issues/2383
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2390
Test Plan:
```
python run.py resnet50 --metrics model_flops -d cuda -t train
Module FLOP % Total
----------------------------- --------- ---------
ResNet 1039.278B 100.00%
- aten.convolution 523.153B 50.34%
- aten.addmm 0.262B 0.03%
- aten.mm 0.262B 0.03%
- aten.convolution_backward 515.601B 49.61%
ResNet.conv1 22.659B 2.18%
- aten.convolution 15.106B 1.45%
- aten.convolution_backward 7.553B 0.73%
ResNet.fc 0.524B 0.05%
- aten.addmm 0.262B 0.03%
- aten.mm 0.262B 0.03%
ResNet.layer1 170.993B 16.45%
- aten.convolution 85.497B 8.23%
- aten.convolution_backward 85.497B 8.23%
ResNet.layer2 263.067B 25.31%
- aten.convolution 131.533B 12.66%
- aten.convolution_backward 131.533B 12.66%
ResNet.layer3 374.870B 36.07%
- aten.convolution 187.435B 18.04%
- aten.convolution_backward 187.435B 18.04%
ResNet.layer4 207.165B 19.93%
- aten.convolution 103.583B 9.97%
- aten.convolution_backward 103.583B 9.97%
Running train method from resnet50 on cuda in eager mode with input batch size 32 and precision fp32.
GPU Time per batch: 90.353 milliseconds
CPU Wall Time per batch: 90.369 milliseconds
CPU Wall Time: 90.369 milliseconds
Model Flops: 11.5004 TFLOPs per second
```
```
python run.py resnet50 --metrics model_flops -d cuda -t train --inductor
Module FLOP % Total
----------------------------- --------- ---------
ResNet 1039.278B 100.00%
- aten.convolution 523.153B 50.34%
- aten.addmm 0.262B 0.03%
- aten.mm 0.262B 0.03%
- aten.convolution_backward 515.601B 49.61%
ResNet.conv1 22.659B 2.18%
- aten.convolution 15.106B 1.45%
- aten.convolution_backward 7.553B 0.73%
ResNet.fc 0.524B 0.05%
- aten.addmm 0.262B 0.03%
- aten.mm 0.262B 0.03%
ResNet.layer1 170.993B 16.45%
- aten.convolution 85.497B 8.23%
- aten.convolution_backward 85.497B 8.23%
ResNet.layer2 263.067B 25.31%
- aten.convolution 131.533B 12.66%
- aten.convolution_backward 131.533B 12.66%
ResNet.layer3 374.870B 36.07%
- aten.convolution 187.435B 18.04%
- aten.convolution_backward 187.435B 18.04%
ResNet.layer4 207.165B 19.93%
- aten.convolution 103.583B 9.97%
- aten.convolution_backward 103.583B 9.97%
Running train method from resnet50 on cuda in dynamo inductor mode with input batch size 32 and precision fp32.
/home/xz/miniconda3/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:150: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
GPU Time per batch: 59.166 milliseconds
CPU Wall Time per batch: 59.211 milliseconds
CPU Wall Time: 59.211 milliseconds
Model Flops: 17.5521 TFLOPs per second
PT2 Compilation time: 18.833 seconds
```
Reviewed By: FindHao
Differential Revision: D60048189
Pulled By: xuzhao9
fbshipit-source-id: 1cf150fe4f2c5d95709cecbb30179c38246af781