reduce the number of autotuning iterations, don't autotune simple til… (#88386)
…ed copies
Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s.
Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386
Approved by: https://github.com/jansel
Author
Natalia Gimelshein