Use aten's GRAIN_SIZE for TH Tensor ops (#28770)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/28198 in my tests on a 24 core AMD threadripper.
Profiling the benchmark showed that most of the slowdown in https://github.com/pytorch/pytorch/issues/28198 was from `THFloatTensor_fill` not being distributed across threads. It internally uses `TH_TENSOR_APPLY_CONTIG` which is a thin wrapper around `at::parallel_for` and uses `TH_OMP_OVERHEAD_THRESHOLD` or 100,000 as the grain size.
Here I've changed it to use `at::internal::GRAIN_SIZE` which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in `ATen` will apply to `TH` as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to `TH_OMP_OVERHEAD_THRESHOLD` as before.
Here are the timing results I get:
| Version | Full iteration time | `index_select` | `mm` | `addmm` |
|:----------:|---------------:|-------------:|---------:|---------:|
| master | 3505.85 ms/it | 184.302 ms | 9.520 ms | 8.494 ms |
| no scaling | 3453.18 ms/it | 184.456 ms | 5.810 ms | 5.069 ms |
| this PR | 3453.23 ms/it | 184.526 ms | 5.824 ms | 5.202 ms |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28770
Differential Revision: D18202646
Pulled By: ezyang
fbshipit-source-id: ab30e5ef24e62213f9bd3abace5c6442c75c9854