Add model torchrec dlrm (#1397)
Summary:
Original model file: https://github.com/facebookresearch/dlrm/blob/main/torchrec_dlrm/dlrm_main.py
Using the default config in the model file.
Need to modify a few places to run on single GPU device, including removing DDP, change device from "meta" to concrete devices, and remove fused optimizer.
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1397
Test Plan:
Experiment on T4 16 GB
```
$ python run.py torchrec_dlrm -d cuda --bs 4096 -t eval
Running eval method from torchrec_dlrm on cuda in eager mode with input batch size 4096.
GPU Time: 5.040 milliseconds
CPU Total Wall Time: 5.066 milliseconds
$ python run.py torchrec_dlrm -d cuda --bs 4096 -t train
Running train method from torchrec_dlrm on cuda in eager mode with input batch size 4096.
GPU Time: 32.600 milliseconds
CPU Total Wall Time: 32.625 milliseconds
```
```
$ python run.py torchrec_dlrm -d cuda --bs 4096 -t eval --torchdynamo inductor
[2023-02-04 14:14:36,265] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to multiple devices
Segmentation fault (core dumped)
$ python run.py torchrec_dlrm -d cuda --bs 4096 -t train --torchdynamo inductor
[2023-02-04 14:14:36,265] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to multiple devices
Segmentation fault (core dumped)
```
Reviewed By: yf225, brad-mengchi
Differential Revision: D43018611
Pulled By: xuzhao9
fbshipit-source-id: 3438535cc0bad2f151fae40305f5ba5e5f990ef6