Fix issue with zero-sized file after merging file on curriculum `map_reduce` (#5106)
In `deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py`
when calling `merge_file_` , the following operation may not flush the
merged file in time, before it's needed:
```
# Concatenate data
with open(data_file_path(another_file), 'rb') as f:
shutil.copyfileobj(f, self._data_file)
```
this leads to `self._data_file` having size zero, and later to the
following error (with stack trace):
```
File "~/my_code/deepspeed_trainer.py", line 999, in my_func
data_analyzer.run_reduce()
File "~/my_env/lib/python3.11/site-packages/deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py", line 413, in run_reduce
self.merge_map_results(self.dataset, self.metric_names, self.metric_types, self.save_path,
File "~/my_env/lib/python3.11/site-packages/deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py", line 371, in merge_map_results
index_to_sample = MMapIndexedDataset(index_to_sample_fname, skip_warmup=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/my_env/lib/python3.11/site-packages/deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py", line 486, in __init__
self._do_init(path, skip_warmup)
File "~/my_env/lib/python3.11/site-packages/deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py", line 502, in _do_init
self._bin_buffer_mmap = np.memmap(data_file_path(self._path), mode='r', order='C')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/my_env/lib/python3.11/site-packages/numpy/core/memmap.py", line 268, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot mmap an empty file
```
This PR fixes that issue by forcing the destination file to be flushed
and adding an assert to make sure the concatenation succeeded.
deepspeed version: '0.13.2'
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>