[takeover] BTRS algorithm for fast/efficient binomial sampling (#36858)
Summary:
The original PR is https://github.com/pytorch/pytorch/pull/31278.
CC: ezyang jamestwebber fritzo zasdfgbnm
---
<!-- # This PR - CPU
In [1]: import torch; import torch.distributions as dist
In [2]: counts = torch.randint(10, 1000, [1000,1000])
...: p = 0.5 * torch.ones(1000, 1000)
In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
94.8 ms ± 911 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
-->
```
# This PR - GPU
In [1]: import torch; import torch.distributions as dist
In [2]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()
In [3]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
737 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# master (commit: 806f22b167c74897cf67c0828b528fa3e4e6d6de) - GPU
In [5]: counts = torch.randint(10, 1000, [1000,1000]).cuda(); p = 0.5 * torch.ones(1000, 1000).cuda()
In [6]: %timeit dist.binomial.Binomial(total_count=counts, probs=p).sample()
46.3 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36858
Differential Revision: D21178367
Pulled By: ezyang
fbshipit-source-id: 7e7d6f463e35b07156d69bd7452040b2f9c2eb7a