Add ncu report analyzer (#2497)
Summary:
This PR adds a ncu report analyzer to analyze the profiled ncu report. It also adds two metrics `memory_traffic` and `arithmetic_intensity`. To avoid excessive profiling overhead, we only profile with necessary ncu metrics.
This PR is a part of [operator benchmarking plan](https://github.com/pytorch/pytorch/issues/136168)
Example commands:
```
python run_benchmark.py triton --op gather_gemv --num-inputs 1 --metrics memory_traffic,arithmetic_intensity --csv
```
Example output:
```
0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 508958 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
==PROF== Profiling "index_elementwise_kernel" - 0: 0%....50%....100% - 3 passes
==PROF== Profiling "unrolled_elementwise_kernel" - 1: 0%....50%....100% - 3 passes
==PROF== Profiling "gemv2T_kernel_val" - 2: 0%....50%....100% - 3 passes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.89s/it]
x_val;test_eager-_ncu_trace_in_task
2048;success
==PROF== Disconnected from process 508958
==WARNING== No source files were imported. Check that the target application was compiled with -lineinfo.
==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_eager_0/ncu_output.ncu-rep
0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 509121 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
==PROF== Profiling "triton_red_fused_mv_0" - 0: 0%....50%....100% - 3 passes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.79s/it]
x_val;test_0-_ncu_trace_in_task
2048;success
==PROF== Disconnected from process 509121
==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_0_0/ncu_output.ncu-rep
0%| | 0/1 [00:00<?, ?it/s]==PROF== Connected to process 509285 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
==PROF== Profiling "triton_red_fused_mv_0" - 0: 0%....50%....100% - 3 passes
==PROF== Connected to process 509433 (/scratch/yhao/miniconda3/envs/pta_gil/bin/python3.10)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.07s/it]
x_val;test_inductor-_ncu_trace_in_task
2048;success
==PROF== Disconnected from process 509285
==PROF== Disconnected from process 509433
==PROF== Report: /scratch/yhao/tmp/tritonbench/gather_gemv/ncu_traces/test_inductor_0/ncu_output.ncu-rep
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.99s/it]
x_val;test_eager-arithmetic_intensity;test_eager-memory_traffic;test_eager-weighted_fp32_arithmetic_intensity;test_0-arithmetic_intensity;test_0-memory_traffic;test_0-weighted_fp32_arithmetic_intensity;test_inductor-arithmetic_intensity;test_inductor-memory_traffic;test_inductor-weighted_fp32_arithmetic_intensity
2048;(0.14937214493924472, 0.0);(29467392.0, 505856.0);0.14937214493924472;(4.364079147640791, 0.0);(4204544.0, 256.0);4.364079147640791;(9.97989888530182, 0.0);(4202752.0, 0.0);9.97989888530182
```
according to ncu, there can be multiple roofline charts on different granularity, such as single precision, double precision, tensorcore, and half precision.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2497
Reviewed By: xuzhao9
Differential Revision: D64359055
Pulled By: FindHao
fbshipit-source-id: a02a4ebfcac5c5209f4196aac5a8eb4f91b3de87