pytorch
602434bc - [te] Benchmark vml-based logit (#51771)

Commit
3 years ago
[te] Benchmark vml-based logit (#51771) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51771 This benchmarks an NNC implementation of logit based on VML's log implementation. It's a modest improvement over the sleef algorithm, but seems to be a bit slower than aten (at larger sizes), and I'm not totally sure why, since you'd think a fused logit kernel would be better than doing clamp/sub/div, followed by log. And yet... Note that it's important to vectorize this kernel by 16, even on an 8-wide AVX2 machine; I suspect that it's needed to give the scheduler enough freedom to fill up both FMA pipes to avoid stalling on fpdiv or (maybe) memory. ghstack-source-id: 121392349 Test Plan: ``` ----------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------- logit_nnc_sleef/64 483 ns 483 ns 1452336 logit/s=132.469M/s logit_nnc_sleef/512 3019 ns 3019 ns 228059 logit/s=169.577M/s logit_nnc_sleef/8192 71427 ns 71424 ns 9662 logit/s=114.695M/s logit_nnc_sleef/32768 307062 ns 306722 ns 2406 logit/s=106.833M/s logit_nnc_fast/64 147 ns 147 ns 4408910 logit/s=434.908M/s logit_nnc_fast/512 781 ns 781 ns 881230 logit/s=655.53M/s logit_nnc_fast/8192 12519 ns 12518 ns 55626 logit/s=654.421M/s logit_nnc_fast/32768 50530 ns 50526 ns 10000 logit/s=648.536M/s logit_nnc_vml/64 125 ns 125 ns 5551460 logit/s=511.603M/s logit_nnc_vml/512 733 ns 733 ns 938444 logit/s=698.955M/s logit_nnc_vml/8192 11282 ns 11280 ns 61610 logit/s=726.23M/s logit_nnc_vml/32768 45051 ns 44991 ns 15473 logit/s=728.325M/s logit_aten/64 450 ns 449 ns 1599269 logit/s=142.429M/s logit_aten/512 1055 ns 1054 ns 665538 logit/s=485.595M/s logit_aten/8192 10865 ns 10864 ns 64152 logit/s=754.032M/s logit_aten/32768 42106 ns 42103 ns 16477 logit/s=778.287M/s logit_caffe2/64 233 ns 233 ns 2952127 logit/s=274.761M/s logit_caffe2/512 1795 ns 1795 ns 393354 logit/s=285.177M/s logit_caffe2/8192 29924 ns 29923 ns 23225 logit/s=273.77M/s logit_caffe2/32768 123899 ns 123893 ns 5642 logit/s=264.487M/s ``` Reviewed By: bwasti Differential Revision: D26272325 fbshipit-source-id: b9771a96e0150685506dbc625e7894e81c93a688
Author
Parents
Loading