Avoid index_put_ overhead in histogram kernel's inner loop (#67815)
Summary:
**TLDR**: Makes torch.histc run 400x faster on large inputs. Should fix [a broken test on internal CI](https://www.internalfb.com/intern/test/281475013640093/).
HistogramKernel presently calls torch.Tensor.index_put_ once for each element of its input tensor. Obtaining a data pointer and manipulating it directly avoids the considerable dispatch overhead from calling index_put_. Behavior is unchanged because the tensor being operated on is known to be contiguous and in CPU memory.
Fixes performance regression introduced in https://github.com/pytorch/pytorch/pull/65318.
Benchmark: time taken to compute histc on a tensor with 10,000,000 elements
1. Before https://github.com/pytorch/pytorch/pull/65318: **0.003s**
2. After https://github.com/pytorch/pytorch/pull/65318: **2.154s**
3. After this change: **0.005s**
Benchmark code:
```
import torch as t
from timeit import default_timer as timer
x = t.randperm(10000000, dtype=t.float32)
start = timer()
t.histc(x)
end = timer()
print(end - start)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67815
Reviewed By: anjali411
Differential Revision: D32357663
Pulled By: saketh-are
fbshipit-source-id: f8fa59173ea4772c8ad1332548ef4d9ea8f01178