[aten] Call fbgemm functions for embedding prepack/unpack (#44845)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44845
fbgemm functions are vectorized and faster
```
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786
Summary (total time 15.08s):
PASS: 7
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
Performance Before:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 68.727
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 131.500
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 248.190
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 172.742
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 333.008
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 652.423
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 167.282
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 398.901
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 785.254
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 122.653
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 230.617
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 408.807
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 176.087
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 337.514
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 659.716
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 342.529
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 665.197
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 1307.923
```
Performance After:
```
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 10.782
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 17.443
# Benchmarking PyTorch: qembeddingbag_byte_prepack
# Mode: Eager
# Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 25.898
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 13.903
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 18.575
# Benchmarking PyTorch: qembeddingbag_4bit_prepack
# Mode: Eager
# Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.650
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 14.158
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 19.818
# Benchmarking PyTorch: qembeddingbag_2bit_prepack
# Mode: Eager
# Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 30.852
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 47.596
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 91.025
# Benchmarking PyTorch: qembeddingbag_byte_unpack
# Mode: Eager
# Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 131.425
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 12.637
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 20.856
# Benchmarking PyTorch: qembeddingbag_4bit_unpack
# Mode: Eager
# Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 33.944
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128
# Input: num_embeddings: 80, embedding_dim: 128
Forward Execution Time (us) : 21.181
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256
# Input: num_embeddings: 80, embedding_dim: 256
Forward Execution Time (us) : 34.213
# Benchmarking PyTorch: qembeddingbag_2bit_unpack
# Mode: Eager
# Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512
# Input: num_embeddings: 80, embedding_dim: 512
Forward Execution Time (us) : 59.622
```
ghstack-source-id: 112836216
Test Plan: buck test //caffe2/test:quantization -- 'test_embedding_bag*' --print-passing-details
Reviewed By: radkris-git
Differential Revision: D23675777
fbshipit-source-id: 0b1a787864663daecc7449295f9ab6264eac52fc