Make the speedup benchmark works with DDP + CompiledAutograd
Summary:
X-link: https://github.com/pytorch/pytorch/pull/120454
With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model.
ghstack-source-id: 217034133
exported-using-ghexport
Reviewed By: xmfan
Differential Revision: D54094257
fbshipit-source-id: 29fb31f9653a50d1a1e5f5ff398a668b7e4209e5