optimize reflection padding performance on CPU (#102254)
This patch improves reflection padding performance on CPU.
Original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;
(after)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;
```
### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;
(after)
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;
ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254
Approved by: https://github.com/cpuhrsch