MaxUnpooling: parallel_for not always backed by OMP (#65655)
Summary:
Use `c10::optional` + thread_fence instead of `#pragma omp critical` inside max_unpooling kernels
Using any OpenMP pragma in `at::parallel_for` body is wrong, as it can
be implemented using native treading algorithms such as ptrheads
`c10::optional` sounds like a much better approach to pair of
`has_error` and `error_index` variables. Use `std::atomic_thread_fence` to ensure error_index value is synchronized.
It also fixes ICE reported in https://github.com/pytorch/pytorch/issues/65578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65655
Reviewed By: ngimel
Differential Revision: D31206501
Pulled By: malfet
fbshipit-source-id: 93df34530e721777b69509cd6c68f5d713fb2b2a