TensorIteratorReduce: Avoid tensor operations in parallel_for (#58655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58655
Ref gh-56794
The two pass reduction calls `copy_` and `select` inside a parallel region. The
`copy_` can just be moved outside of the parallel region, but avoiding the
`select` call is more complicated because it's needed to construct the
`TensorIterator`. Instead, I factor out a `serial_for_each` free-function that
just takes pointers and strides. Then manually advance the pointer to the
thread-specific slice of data.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D28735330
Pulled By: ngimel
fbshipit-source-id: 8e096eb5801af9381ebd305e3ae7796a79b86298