q_avgpool: Loop over batch dimension inside operators (#66819)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66819
This has a number of different advantages:
- For channels last tensors, DispatchStub overhead is only incurred once.
- For contiguous tensors, parallelization now happens over batch and
chanels, enabling better load balancing between threads.
- `q_scale()` and `q_zero_point()` are no longer called inside of a
parallel region, which is not allowed (see gh-56794)
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D32445352
Pulled By: ngimel
fbshipit-source-id: cd938e886cd5696855eb56a649eaf3bccce35e54