fix parallelization detection for CPU foreach_reduced_elt (#15483)
Summary:
This does two things:
(1): revert #15114 , which is incorrect and actually just completely disables parallelization in this function (because `at::get_num_threads` returns `-1` unless it has been set explicitly)
(2): Fix our (FB-internal) failing tests that #15114 was intended to fix, by still working correctly in a setup where `#ifdef _OPENMP` is set and `omp_get_max_threads() > 1` , but `#pragma omp parallel` only launches one thread. I believe such an unusual situation only exists in certain unit tests within FB infra but we still need it to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15483
Differential Revision: D13538940
Pulled By: umanwizard
fbshipit-source-id: a3362c7ac7327ced350d127bb426f82c59e42732