[AI Accelerators] softmax kernel for Nested Tensor (CPU) (#79756)
Summary: Impl better softmax kernel for Nested Tensor CPU.
Test Plan:
Benchmark results:
On CPU (command: buck run mode/opt -c fbcode.platform=platform009 //pytext/fb/tools:benchmark_transformers -- transformer --large --use-trt-kernel False --batch-size 16 --avg-sequence-length 64 --max-sequence-length 256 --iters 10 --use-real-data-distribution --module native --use-nt True --use-cpu True
With mask (previous impl):
NT: 4573.14 ms/iter, 0.14 TFLOP/s, Speedup: 2.33x;
Without mask:
NT: 3530.55 ms/iter, 0.18 TFLOP/s, Speedup: 1.51x
Reviewed By: mikekgfb
Differential Revision: D35679352
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79756
Approved by: https://github.com/erichan1