Run single iteration when collecting ncu traces
Summary: We assume that NCU will handle the warmup and kernel repeat by itself, so we remove warmup and repeated runs in the Tritonbench framework when running with NCU.
Reviewed By: int3
Differential Revision: D62451609
fbshipit-source-id: d61d8a58500b8009db9d7f93cef730b48b063667