DeepSpeed
e73de8ce - Optimize Softmax Kernel (#3112)

Commit
2 years ago
Optimize Softmax Kernel (#3112) * Simplify kernel * Coalesce memory attempt 1. Logits divergence. * Logits fix? * sync after every global mem access * template on iterations. Down to 8.3% cuda time for 8k tokens * Up to 64 iterations * Add alibi/mask check * fp32 * Revert builder.py * naming. precommit * Revert "naming. precommit" This reverts commit 150eb7d96b6084190265b440739317216992bd82. * naming. spacing * Spacing. simplify checks * remove bsyncs * missed bsyncs * precommit
Author
Parents
Loading