pytorch
3ada2e0d - [pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049)

Commit
4 years ago
[pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049) Summary: Pull Request resolved: https://github.com/pytorch/glow/pull/4049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477 We would like to add the intra-op parallelization support for the EmbeddingBag operator. This should bring speedup for the DLRM benchmark: https://github.com/pytorch/pytorch/pull/24385 Benchmark code: ``` from __future__ import absolute_import, division, print_function, unicode_literals import torch import time eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum') input = torch.LongTensor(1500).random_(0, 1000000) offsets = torch.zeros(64, dtype=torch.int64) niter = 10000 s = time.time() for _ in range(niter): out = eb(input, offsets) time_per_iter = (time.time() - s) / niter print('time_per_iter', time_per_iter) print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9) ``` The following results are single core on Skylake T6: - Before our change (with the original caffe2::EmbeddingLookup) time_per_iter 6.313693523406982e-05 GB/s 6.341517821789133 - After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths. time_per_iter 5.7627105712890626e-05 GB/s 6.947841559053659 - With Intel's PR: https://github.com/pytorch/pytorch/pull/24385 time_per_iter 7.393271923065185e-05 GB/s 5.415518381664018 For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6. ghstack-source-id: 97124557 Test Plan: With D16990830: ``` buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench ``` With D17750961: ``` buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb ``` OSS test ``` python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu ``` Buck test ``` buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details ``` Generate the AVX2 code for embedding_lookup_idx_avx2.cc: ``` python hp_emblookup_codegen.py --use-offsets ``` Differential Revision: D17768404 fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700
Author
Parents
Loading