Choose num_threads in parallel_for based on GRAIN_SIZE (#26963)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24080, Continuation of https://github.com/pytorch/pytorch/issues/26886
What soumith said in https://github.com/pytorch/pytorch/pull/26886#issuecomment-535760635 seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.
I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26963
Differential Revision: D17626981
Pulled By: soumith
fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512