[caffe2] Optimize Dedup version of RowWiseSparseAdagrad fused op by WarpReduce (#45649)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45649
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44275
* This Diff applies WarpReduce optimization for dedup version of RowWiseSparseAdagrad fused op. Basically we can achieve ~1.33x performance improvement with this Diff.
* Port the way from D23948802 to find the num_dup
* fix the likely bug about fp16 in the dedup kernel
Reviewed By: jianyuh
Differential Revision: D23561994
fbshipit-source-id: 1a633fcdc924593063a67f9ce0d36eadb19a7efb