pytorch
c1795977 - Add native impl for group norm on quantized CPU for channels-last inputs (#70520)

Commit

2 years ago

Add native impl for group norm on quantized CPU for channels-last inputs (#70520) **Description** This PR adds a native implementation for group norm on quantized CPU for channels-last inputs. It will also be used by instance norm since the latter would call group norm kernel eventually. For channels last, group norm has an input shape of `{N, H, W, GD}`, mean and rstd are collected per each n and g, which involves reduction on non-adjacent dimensions. We can parallel in the following 2 impls: - impl-1: parallel on `N * G`. Only need one omp session but memory access per thread is non-contiguous. - impl-2: parallel on `N * HxW`. Memory access per thread is contiguous but requires help of extra temp buffer of size `{T, N, 2C}`. Generally, impl-2 has better performance when `HxW` is large enough, so that data per thread `{NHWC / T}` is much larger than temp buffer per thread `{2NC}` A threshold is defined to switch between the two implementations, which is found by tests. The unit test for quantized group norm is modified to cover more cases. **Performance test results** Test Env: - Intel® Xeon® CLX-8260 - 1 instance, 4 cores - Using Jemalloc Test method: Create channels-last tensors as inputs, do group norm by - Converting to contiguous then using NCHW kernel - Using NHWC impl-1 - Using NHWC impl-2 - Using fp32 kernel (no quantization) C=64 ![image](https://user-images.githubusercontent.com/12522207/147727406-baeb5c03-12a3-4925-8816-2ce84f2742a8.png) C=256 ![image](https://user-images.githubusercontent.com/12522207/147727617-6f631448-a1d3-483f-9f7a-60129bd36804.png) C=1024 ![image](https://user-images.githubusercontent.com/12522207/147727927-14ad766c-0ee3-4528-99b4-4434fc4a0a75.png) We selected 512 as the threshold as mentioned above according to the test results. Pull Request resolved: https://github.com/pytorch/pytorch/pull/70520 Approved by: https://github.com/vkuzo

Author

Xia-Weiwen

Committer

pytorchmergebot

Parents

94090150

pytorch c1795977 - Add native impl for group norm on quantized CPU for channels-last inputs (#70520)

pytorch
c1795977 - Add native impl for group norm on quantized CPU for channels-last inputs (#70520)