Use counter instead of vector of futures in `_parallel_run` (#36159)
Summary:
This should be faster than allocating one mutex, flag and conditional variable per task.
Using `std::atomic<size_t>` to count remaing tasks is not sufficient,
because modification of remaining counter and signalling conditional variable must happen atomically,
otherwise `wait()` might get invoked after `notify_one()` was called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36159
Test Plan: CI
Differential Revision: D20905411
Pulled By: malfet
fbshipit-source-id: facaf599693649c3f43edafc49f369e90d2f60de