Cummax and Cummin doc update and performance benchmark (#32537)
Summary:
[CPU] Benchmark results for cummax, cummin:
In [1]: import torch
In [2]: x=torch.randn(5,6,7).cuda()
In [3]: %timeit x.cummax(0)
134 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: %timeit x.max(0)
114 µs ± 560 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit x.cummax(1)
134 µs ± 760 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: %timeit x.max(1)
118 µs ± 514 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit x.cumsum(0)
97.1 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: %timeit x.cumprod(0)
83.6 µs ± 689 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [9]: %timeit x.cumprod(1)
86.3 µs ± 528 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: y=torch.randn(5,6,7)
In [11]: %timeit y.cummax(0)
148 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit y.max(0)
111 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [13]: %timeit y.cumsum(0)
54.8 µs ± 311 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [14]: %timeit y.cumprod(0)
56.2 µs ± 836 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32537
Differential Revision: D19951171
Pulled By: anjali411
fbshipit-source-id: cf972c550189473e9ce62e24ac7dd34b9373fef9