use faster cache flush in triton benchmarking (#88557)
Speeds up autotuning a little bit more (about 90s -> 75s for coat_lite_mini)
@bertmaher, I've put in workaround so that internal doesn't break, but it can be removed once triton is updated internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88557
Approved by: https://github.com/anijain2305
Author
Natalia Gimelshein