hf and multimodal clip (#1921)
Summary:
multimodal clip is in canary because of torchtext dependency so this adds the hf version
Notably this PR also makes it possible to support dict based inputs to `get_module()` which is very common in HF code
Pull Request resolved: https://github.com/pytorch/benchmark/pull/1921
Reviewed By: kartikayk, xuzhao9
Differential Revision: D49584110
Pulled By: msaroufim
fbshipit-source-id: 34bc581515c860b05018413004b9b8067709fedc