Workaround performance bug / memory leak in GOMP (#32875)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/32008
This is similar to CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the `if` clause from before https://github.com/pytorch/pytorch/issues/26963 so that running on 1 thread should still avoid additional synchronisation.
One comment is that this does slightly change the meaning of `at::get_num_threads` inside of a `parallel_for` loop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places in `quantized` that seem to make this assumption but they always use a grain size of 1 so should be safe:
https://github.com/pytorch/pytorch/blob/d9e99ab544cceaf346605db1af4a862197a107cd/aten/src/ATen/native/quantized/cpu/qconv.cpp#L436-L437
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32875
Differential Revision: D19775823
Pulled By: VitalyFedyunin
fbshipit-source-id: 4f843b78cdb9e2766339590d728923786a00af6d