Parallel map step for `DistributedDataAnalyzer` map-reduce (#5291)
- adds multi CPU-processing to the `DistributedDataAnalyzer` map
operation (parallelism set with parameter `num_workers`). Works with a
`SharedMemory` / `Manager's` queue per metric, written concurrently by
processes.
- much faster `write_buffer_to_file` in `DistributedDataAnalyzer` reduce
operation by copying to cpu and "detaching" output tensor.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>