pytorch
69b09eca - optimize reflection padding performance on CPU (#102254)

Commit
2 years ago
optimize reflection padding performance on CPU (#102254) This patch improves reflection padding performance on CPU. Original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope. The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket. ### single core inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms; ``` ### single socket inference ``` (before) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms; (after) ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms; ReflectionPad2d((2, 2, 2, 2)) size: torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254 Approved by: https://github.com/cpuhrsch
Author
Committer
Parents
Loading