[caffe2] Fix a performance bug in Dedup SparseAdagrad op (#42287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42287
We shouldn't use block_size for thread dimensions in linear_index_weight_offsets_dedup_kernel, since the kernel doesn't iterate the embedding dimensions.
ghstack-source-id: 108834058
Test Plan:
```
buck test mode/dev-nosan //caffe2/caffe2/fb/net_transforms/tests:fuse_sparse_ops_test -- 'test_fuse_sparse_adagrad_with_sparse_lengths_sum_gradient \(caffe2\.caffe2\.fb\.net_transforms\.tests\.fuse_sparse_ops_test\.TestFuseSparseOps\)' --print-passing-details
```
Reviewed By: jspark1105
Differential Revision: D22800959
fbshipit-source-id: 641d52a51070715c04f9fd286e7e22ac62001f61