[hierarchical sharding 5/n] enable table-wise -> col-wise sharding in embedding table lookup
Summary:
This diff add table-wise -> col-wise sharding support in GroupedShardedEmbeddingBag. Changes includes:
1. Add necessary member variables set up.
2. Create new fast kernel and add fast kernel lookup support
3. Add intra-host all2all and cross-host all2all logic.
Test Plan:
UT
```
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_spawn
```
```
buck test caffe2/torch/fb/hpc/tests:model_sharder_test
```
QPS check:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 10000 --num-dpp-worker-threads 16 --num-readers 100 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "["table_based", "column_based"]" --flow-entitlement ads_global_qps
```
with diff:
dec inline_cvr:
table-wise -> table-wise (82K):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_d0a0cba5?version=0&tab=status&env=PRODUCTION
table-wise -> column-wise (80k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_b1ac5873
column-wise:
dec inline_cvr:
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623827677%2F127.0.0.1%2Flibkineto_activities_4550.json.gz&bucket=gpu_traces
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_a79e1522 (81k)
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_2dacc13e (88k)
row-wise(62k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_4e349cab
table-wise(90k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_5d51b608
10x ctr_mbl_feed:
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 128 --use-shrunk-model false --model-version=ctr_mbl_oct_2020_10x_3tb --num-dpp-worker-threads 16 --num-readers 200 --fast-kernel table_batched --max-batches 5000000 --hpc-identity ads_model_platform --table-partition column_based --flow-entitlement ads_global_tc_mimo
```
column-wise:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_f05fb306?version=0&tab=status&env=PRODUCTION (290k)
w/o diff:
dec inline_cvr:
column-wise (87K):
gpu trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1623864444%2F127.0.0.1%2Flibkineto_activities_4451.json.gz&bucket=gpu_traces
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_e1315f14
row-wise (60k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_8fcc0adf
table-wise (91k):
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_cb94ff41
10x ctr_mbl_feed:
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_203ef35b?version=0&tab=status&env=PRODUCTION (281k)
NE check(use deterministic reading D28711400)
```
buck run mode/dev-nosan -c python.package_style=inplace caffe2/torch/fb/training_toolkit/examples:sync_sgd_local_driver -- prod-preset --num-trainers 32 --use-shrunk-model false --model-version=inline_cvr_dec_2020 --fast-kernel table_batched --max-batches 100000 --num-dpp-worker-threads 16 --num-readers 64 --hpc-identity ads_model_platform --table-partition hierarchical_based --hierarchical-options "[table_based, column_based]" --flow-entitlement ads_global_qps --use-deterministic-model --use-deterministic-reading --model-entity-id 995557193
```
w/o this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps|total_examples 1867776.0
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps|window_qps 491.5199890136719
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION
w this diff:
```
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne|lifetime_ne 0.8660048340401448
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: ne-ne|window_ne 0.8660048340401447
I0611 12:19:18.766000 647 print_publisher.py:33 master ] Publishing batch metrics: qps-qps|total_examples 1867776.0
```
https://www.internalfb.com/mast/job/tsm_ruilinchen-SparseNNApplication_15bc6243?version=0&tab=status&env=PRODUCTION
Reviewed By: JadeNie
Differential Revision: D28689126
fbshipit-source-id: 1c7879d4e3ee2b90aaf2a89e87f7b827d54173b3