optimize replication padding performance on CPU (#102255)
The major difference from the previous PR on ReflectionPad is the padding indexing struct, `ReplicationPad::index()`, the rest of the part is pretty much the same.
The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.
### single core inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;
```
### single socket inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;
(after)
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;
ReplicationPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102255
Approved by: https://github.com/cpuhrsch