Optimized bincount for the CPU by removing extra size() calls (#35822)
Summary:
By removing the calls of `size` that were effectively nops, I've managed to make `bincount_cpu` run around 6 times faster on my machine. EDIT: (Running Windows 10, I'm suspecting this may be a Windows-specific bug)
For histogramming 1e7 samples with 1e5 bins, best of 20 with 10 runs each
Before: 3.201189
After: 0.466188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35822
Differential Revision: D20919885
Pulled By: ezyang
fbshipit-source-id: 1657056d69a02f1e61434f4cc8fa800f8d4e1fe8