[pytorch][embeddingbag_8bit] Add include_last_offset option to Fused 8bit EmbeddingBag and parallelize the op (#32683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683
Pull Request resolved: https://github.com/pytorch/glow/pull/4079
Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op.
ghstack-source-id: 97404645
Test Plan:
To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`):
```
python hp_emblookup_codegen.py --fused --use-offsets
```
To test the correctness:
```
buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets --print-passing-details
```
Reviewed By: yinghai
Differential Revision: D19592761
fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0