[pytorch] Support embedding_bag_4bit_rowwise_offsets in cuda (#61728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61728
Templatize existing embedding_bag_byte_rowwise_offsets_kernel to support both 4 bits per dimension and 8 bits per dimension. Test rigorously using fb internal random testing vs CPU ops.
Reviewed By: hyuen
Differential Revision: D29706346
fbshipit-source-id: c9f4591a2cc6205e4b7e57a363ba0a6306fdddd5