Support multi-loop parallel sections, use multi-loop sections in GRU (#5602)
This PR updates the ThreadPool API to support multi-loop parallel sections. As with the OpenMP "parallel" construct, this allows per-loop work to be amortized over a series of loops. For ORT, it also promotes locality between successive loops in the sense that iteration X of one loop will tend to run on the same worker thread as iteration X of preceding loops.
The change was developed while optimizing the implementation of a model that performed better with OpenMP. Profiling indicated that OpenMP was providing lower loop entry/exit costs and that, via OpenMP's static scheduling, it was leading to a lower L2 miss rate in the series of parallel loops used in GRU.
The main changes are:
- Addition of ThreadPool::ParallelSection and underlying support in the modified Eigen thread pool.
- In EigenNonBlockingThreadPool.h, refactoring the RunInParallel method to support two variants: one that takes an existing parallel section object created by the caller, and another (used by default) that creates its own parallel section.
- Simplify ThreadPool::LoopCounter (used by worker threads to claim loop iterations), basing it an ID supplied by the underlying Eigen thread pool for affinity in a series of loops.
- Fix a possible perf issue where a loop with iterations scheduled in batches would have more threads than batches available.
- Use of parallel sections in the GRU operator.
- Additional test cases in threadpool_test.h.
- Additional comments at the top of threadpool.h and EigenNonBlockingThreadPool.h.