Fix padding_idx in the new embedding cuda kernel. (#27731)
Summary:
The current embedding backwards CUDA kernel is somewhat broken. It effectively ignores padding_idx and also incorrectly drops an index from the input.
This commit fixes that bug and fixes the unit test so that this behavior won't break in the future.
This fixes https://github.com/pytorch/pytorch/issues/26302.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27731
Differential Revision: D17893803
Pulled By: ngimel
fbshipit-source-id: 4ba02a17ec0e29a7016d65480d4ff0c276550616