Enabling concat fast path for channels last inputs (#39448)
Summary:
Updates concat kernel for contiguous input to support channels_last contig tensors.
This was tried on squeezenet model on pixel-2 device. It improves model perf by about 25%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39448
Test Plan: test_cat_in_channels_last
Differential Revision: D22160526
Pulled By: kimishpatel
fbshipit-source-id: 6eee6e74b8a5c66167828283d16a52022a16997f