(#24396)
Summary:
Initial kernel support added for optimized NHWC tensor.
TODO: currently backwards kernel spits out tensor with NHWC stride.
Unfortunately autograd restores grad to contiguous (in either copy or add). This
makes real perf tuning annoying to do. (since I cannot easily measure end-to-end
time in my python script)
My current kernel is blazing fast comparing to the original NCHW kernel in fp16,
since I avoided atomicAdd. I'll finish perf tuning after we merged some future
PR expanding NHWC support in the core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24396
Differential Revision: D18115941
Pulled By: VitalyFedyunin
fbshipit-source-id: 57b4922b7bf308430ffe1406681f68629baf8834