Add an argument to specify warmup iterations (#78124)
Summary: Add an argument to specify the number of warmup iterations to the API ``torch.cuda.make_graphed_callables``. By default, it needs 3 warm-up iterations. To work with NCCL, it needs 11 warm-up iterations.
Differential Revision: D36606758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78124
Approved by: https://github.com/jianyuh