[PyTorch] Expose interface to set grain size on tensor iterator (#58949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58949
To parallelize ops grain size setting is exposed at for_each level.
This is too far deep in the stack cpu_kernel_vec which does not know what the
op is. You would want to parallelize op depending on the op type. Non trivial
ops can benefit from threads even when the # of elements in tensor is not high.
This API exposes setting grain size at tensor iterator level so that operator
creating it can have control over it.
ghstack-source-id: 130947175
Test Plan: CI + will add more test
Reviewed By: ezyang
Differential Revision: D26857523
fbshipit-source-id: 09fc2953061069967caa9c78b010cb1b68fcc6c9