out variant for native_batch_norm forward (#29192)
Summary:
This is dealing with forward of native BatchNorm CUDA impl to support inplace operation. The larger issue: https://github.com/pytorch/pytorch/issues/26288
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29192
Differential Revision: D19410370
Pulled By: ezyang
fbshipit-source-id: a6889c96bdd848f3a1cb2d943d06e054d22fb7ab