Remove call to `.contiguous()` for `local_shard_t`.
The call to contiguous was probably left over from a previous
implementation and is no longer needed.
Had to adjust atol for one of the tests to accomodate for this.
Differential Revision: [D36797942](https://our.internmc.facebook.com/intern/diff/D36797942/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78598
Approved by: https://github.com/kumpera