Reverting launch bounds change in topK that induced a regression in perf (#63431)
Summary:
[topkwsyncs.zip](https://github.com/pytorch/pytorch/files/7003077/topkwsyncs.zip)
Running this script on nvidia containers 21.08 vs 21.07 we see the following perf drops:
topk(input=(dtype=torch.float16,shape=[60, 201600]), k=2000, dim=1, sorted=True) - 0.63
topk(input=(dtype=torch.float32,shape=[120000]), k=12000, dim=0, sorted=False) - 0.55
topk(input=(dtype=torch.float16,shape=[5, 201600]), k=2000, dim=1, sorted=True) - 0.55
topk(input=(dtype=torch.float32,shape=[1, 10000]), k=1000, dim=1, sorted=False) - 0.33
The relative perf drop is reported as (21.08_time - 21.07_time) / 21.07_time
I narrowed down the source of the regression to this commit: https://github.com/pytorch/pytorch/pull/60314
which reduced launch bounds from 1024 to 512.
The perf did not seem to regress in the original evidence provided to change 1024 to 512 due to the input shapes in the benchmark being a lot smaller than the input shapes of the tensors which I am witnessing perf regression in. I suggest reverting back to 1024 as with 512 there was no considerable improvement in perf for small inputs and a major regression in perf for large tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63431
Reviewed By: mruberry
Differential Revision: D30384087
Pulled By: ngimel
fbshipit-source-id: 11eecbba82a069b1d4579d674c3f644ab8060ad2