pytorch
94e83da5 - Optimization of the Embedding and Embedding-Bag CUDA Kernel (#22016)

Commit View On GitHub

Commit

5 years ago

Optimization of the Embedding and Embedding-Bag CUDA Kernel (#22016) Summary: Re-implementation of the `embedding_dense_backward_cuda()` and the `embedding_bag_backward_cuda_sum_avg()` functions. #### Performance Running a [Mortgage Workflow](https://github.com/EvenOldridge/MortgageWorkflowA) with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup: ``` Original version: 370,168 example/s Optimized version: 1,034,228 example/s ``` The original version is bounded by the `EmbeddingBag_accGradParametersKernel_sum_avg`, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time. #### Greater Numerical Stability An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by `NROWS_PER_THREAD` before the final result are accumulated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/22016 Differential Revision: D15944339 Pulled By: mrshenli fbshipit-source-id: 398d5f48826a017fc4b31c24c3f8b56d01830bf0

Author

madsbk

Committer

facebook-github-bot

Parents

b0bd8758

pytorch 94e83da5 - Optimization of the Embedding and Embedding-Bag CUDA Kernel (#22016)

Commit

pytorch
94e83da5 - Optimization of the Embedding and Embedding-Bag CUDA Kernel (#22016)