Add oss support to ncu rep (#2387)
Summary:
We need to append the `sys.executable` when running NCU in OSS environment. This is not needed in Meta internal.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/2387
Test Plan:
```
TORCH_CUDA_ARCH_LIST=9.0a CUDA_VISIBLE_DEVICES=5 python run_benchmark.py triton --op flash_attention --only flash_v3 --num-inputs 1 --dump-csv --metrics ncu_rep --batch 8 --n-heads 16 --d-head 128
SeqLen flash_v3-ncu_rep
-------- -------------------------------------------------------------------------
128 /tmp/tritonbench/flash_attention/ncu_traces/flash_v3_0/ncu_output.ncu-rep
```
Reviewed By: manman-ren
Differential Revision: D59920615
Pulled By: xuzhao9
fbshipit-source-id: c27b9aef7048dbefcd93a7233df632a8886c71c9