enable channels last for replication padding on CPU (#102597)
Enable channels last support for replication padding on CPU. This patch add channels last support for ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad3d_cpu_float32
```
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
### single core inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 82.935 ms
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 16.717 ms
```
### single socket inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 7.203 ms
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NHWC: 3.174 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102597
Approved by: https://github.com/CaoE, https://github.com/cpuhrsch