handle grad with `stride=0` on GPU MvBackward (#38321)
Summary:
References : https://github.com/pytorch/pytorch/issues/38315 , https://github.com/pytorch/pytorch/issues/29984
cuBlas expects strides to be greater than 0.
Cloning the `grad` allocates a new vector with
non-zero strides.
For CPU, we don't clone and allocate a new vector
as CPU implementation works with stride=0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38321
Differential Revision: D21628966
Pulled By: ngimel
fbshipit-source-id: 390caf835af6d1d77ed537b7fcc113a22c3ec301