Reland: Fix CUDA device guard usage when first arg of kernel is scalar (#39956)
Summary:
Reland PR https://github.com/pytorch/pytorch/issues/39870
Closes https://github.com/pytorch/pytorch/issues/38889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39956
Differential Revision: D22027956
Pulled By: ngimel
fbshipit-source-id: e6029f450e2da3782b2d05bcc2012c19b82291da