[C2] Native GPU implementation for bucketize (#33529)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33529
Current version goes through GPU -> CPU -> GPU copy and is pretty slow: ~19 ms
for 1M elements with 20 possible buckets based on benchmark.
This new version is ~0.2 on the same
Test Plan: benchmark + unit-test
Reviewed By: chocjy
Differential Revision: D19969518
fbshipit-source-id: 51889bc9a232b6d45d9533e53b7b7f4531da481f