Optimize Softmax Kernel (#3112)
* Simplify kernel
* Coalesce memory attempt 1. Logits divergence.
* Logits fix?
* sync after every global mem access
* template on iterations. Down to 8.3% cuda time for 8k tokens
* Up to 64 iterations
* Add alibi/mask check
* fp32
* Revert builder.py
* naming. precommit
* Revert "naming. precommit"
This reverts commit 150eb7d96b6084190265b440739317216992bd82.
* naming. spacing
* Spacing. simplify checks
* remove bsyncs
* missed bsyncs
* precommit