[PyTorch] Remove reference_cast in make_boxed_from_unboxed_functor (#51319)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51319
We were going out of our way to accommodate `IValue::to<Tensor>` returning a copy of the inner Tensor. `IValue::toTensor` is capable of returning a reference without copying, so if we use it directly, we can allow kernels that want to take `Tensor &` to do so!
As a bonus, we get reduced build times.
ghstack-source-id: 121378961
Test Plan:
Rely on CI for correctness.
Profiled build time with -ftime-trace for RegisterCPU.cpp using an extracted build invocation.
Before: P168244900
After: P168245014
Note reduced time spent compiling make_boxed_from_unboxed_functor.
I also ran the AdIndexer benchmark (https://fb.quip.com/ztERAYjuzdlr) with static runtime disabled and batch size 1 to see how big the effect on boxed call performance was (any kernels that take `Tensor&` or `const Tensor&` should now actually save a refcount bump). Looks like it was roughly 1% better:
Before: 124-125 usec/iter
After: 122-123 usec/iter
Reviewed By: bhosmer
Differential Revision: D26138549
fbshipit-source-id: b0f830527da360c542c815bef2f7e1692615b32a