[te] Benchmark vml-based logit (#51771)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771
This benchmarks an NNC implementation of logit based on VML's log
implementation.
It's a modest improvement over the sleef algorithm, but seems to be a bit
slower than aten (at larger sizes), and I'm not totally sure why, since you'd
think a fused logit kernel would be better than doing clamp/sub/div, followed
by log. And yet...
Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2
machine; I suspect that it's needed to give the scheduler enough freedom to
fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory.
ghstack-source-id: 121392349
Test Plan:
```
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
logit_nnc_sleef/64 483 ns 483 ns 1452336 logit/s=132.469M/s
logit_nnc_sleef/512 3019 ns 3019 ns 228059 logit/s=169.577M/s
logit_nnc_sleef/8192 71427 ns 71424 ns 9662 logit/s=114.695M/s
logit_nnc_sleef/32768 307062 ns 306722 ns 2406 logit/s=106.833M/s
logit_nnc_fast/64 147 ns 147 ns 4408910 logit/s=434.908M/s
logit_nnc_fast/512 781 ns 781 ns 881230 logit/s=655.53M/s
logit_nnc_fast/8192 12519 ns 12518 ns 55626 logit/s=654.421M/s
logit_nnc_fast/32768 50530 ns 50526 ns 10000 logit/s=648.536M/s
logit_nnc_vml/64 125 ns 125 ns 5551460 logit/s=511.603M/s
logit_nnc_vml/512 733 ns 733 ns 938444 logit/s=698.955M/s
logit_nnc_vml/8192 11282 ns 11280 ns 61610 logit/s=726.23M/s
logit_nnc_vml/32768 45051 ns 44991 ns 15473 logit/s=728.325M/s
logit_aten/64 450 ns 449 ns 1599269 logit/s=142.429M/s
logit_aten/512 1055 ns 1054 ns 665538 logit/s=485.595M/s
logit_aten/8192 10865 ns 10864 ns 64152 logit/s=754.032M/s
logit_aten/32768 42106 ns 42103 ns 16477 logit/s=778.287M/s
logit_caffe2/64 233 ns 233 ns 2952127 logit/s=274.761M/s
logit_caffe2/512 1795 ns 1795 ns 393354 logit/s=285.177M/s
logit_caffe2/8192 29924 ns 29923 ns 23225 logit/s=273.77M/s
logit_caffe2/32768 123899 ns 123893 ns 5642 logit/s=264.487M/s
```
Reviewed By: bwasti
Differential Revision: D26272325
fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688