Optimize Softmax Kernel (#3112)

Commit

2 years ago

Optimize Softmax Kernel (#3112) * Simplify kernel * Coalesce memory attempt 1. Logits divergence. * Logits fix? * sync after every global mem access * template on iterations. Down to 8.3% cuda time for 8k tokens * Up to 64 iterations * Add alibi/mask check * fp32 * Revert builder.py * naming. precommit * Revert "naming. precommit" This reverts commit 150eb7d96b6084190265b440739317216992bd82. * naming. spacing * Spacing. simplify checks * remove bsyncs * missed bsyncs * precommit

References

#3112 - Optimize Softmax Kernel

Author

molly-smith

Parents

f2c9a827

DeepSpeed e73de8ce - Optimize Softmax Kernel (#3112)

DeepSpeed
e73de8ce - Optimize Softmax Kernel (#3112)