[PyTorch] [Quantization] Speed up PackedEmbeddingBagWeight::prepack() (#66632)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66632
Calling `.item<float>()` for each element in a tensor is expensive. Instead convert the entire Tensor in one call to `Tensor::copy_(input_tensor)`. See [this post](https://fb.workplace.com/groups/1144215345733672/posts/2080756188746245/) for more details.
ghstack-source-id: 140639868
Test Plan:
Build and run with bundled inputs.
### AI Bench
Before: [AI Bench](https://www.internalfb.com/intern/aibench/details/877359346171823), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v6_perf_1634185889953.html): 500ms
After: [AI Bench](https://www.internalfb.com/intern/aibench/details/60828780633319), [Flamegraph](https://interncache-all.fbcdn.net/manifold/aibench/tree/mobile/pt/profiling_reports/speech_transducer_v6_perf_1634231176980.html): 444ms
We went from 500ms to 444ms, which is a reduction of ~11%.
Reviewed By: supriyar
Differential Revision: D31657430
fbshipit-source-id: 199ec9de3dab84bb5727d81c7804bb83bebf7b48