Make .contiguous(memory_format) call .clone(memory_format) (#61456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61456
functorch is unable to `vmap(grad(f))` when `f` contains a `.contiguous`
call. This is because `.contiguous` (when it is not a no-op) decomposes
to `.copy_` under grad and the `.copy_` is not compatible with vmap.
The fix for this is to have `.contiguous` call `.clone` instead of
`.copy_`. `clone` is a primitive w.r.t. to autograd, so `grad`
decomposes contiguous into clone.
Perf testing (forward pass)
- [script and
output](https://gist.github.com/zou3519/294f583b9c5d7bdf234d5295f97fb02e)
- The instruction count increased from 774479 to 781379. This is because
we're now calling .clone(), which does an additional dispatch. We could
optimize the implementation of clone() to not dispatch on .copy_() in
the future if we really care about this.
Perf testing (backward pass)
- [script and
output](https://gist.github.com/zou3519/6fbdb121de6342334192d55c8a72276a)
- The instruction count decreased from 5402648 to 5335977. This is
because the [backward for
.clone](https://github.com/pytorch/pytorch/blob/9b908ab0d0a947d89ac3137f8c4a05a87c35f568/tools/autograd/derivatives.yaml#L383)
is a lot simpler than the [backward for
copy_](https://github.com/pytorch/pytorch/blob/9b908ab0d0a947d89ac3137f8c4a05a87c35f568/torch/csrc/autograd/functions/tensor.cpp#L37-L41)
- The backward for .clone() and .copy_() end up doing the same thing for
contiguous (from reading the code above, they both do no-op copies).
Test Plan:
- wait for existing tests (test_view_ops have the tests)
- functorch isn't tested in PyTorch CI yet.
- Taking suggestions on how to write a test for this. I'm thinking we
could use LoggingTensor from #59760 (because it logs underneath
autograd) and test that clone is called instead of copy_ but I didn't
want to refactor it into a utility
Reviewed By: soulitzer
Differential Revision: D29636859
Pulled By: zou3519
fbshipit-source-id: 97eb56bfae1c4bb31612dc9d06536019f21d69a6