Fast path for contiguous tensor (#38732)
Summary:
A local run shows it improves running 2000 guards time from 0.00282s to 0.00187s (~30%). This is for the case when tensor is contiguous, we don't have to recompute whether it's contiguous from stride for each dimension.
We can further optimize other cases if there's a repro script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38732
Differential Revision: D21664191
Pulled By: ailzhang
fbshipit-source-id: 125950f20c8676afc447f1d27ce4d14bbd445918