optimize topk for greedysearch (#14271)
Optimize top 1 computation in greedysearch.
For vocabulary size 50k on A100,
- batch size 1: from 220us to 10.4us.
- batch size 4, from 230us to 11.5us.
For generation of 50 tokens for example, it saves 50*0.2ms = 10ms.