enable fast path for TensorIterator for contiguous inputs/no broadcast (#29180)
Summary:
As title. Also, replaces output allocation by `empty` instead of `empty_strided` in the regular path when possible, thus avoiding resizing of outputs and taking additional DeviceGuard for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29180
Test Plan: covered by existing tests
Differential Revision: D18327836
Pulled By: ngimel
fbshipit-source-id: e8d925f0fe915f327ec41aba83fd6857b09772b5