Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. (#5108)
The map-reduce types are a mess. By looking at the file
`deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py`, we
see that the reduce only accepts numpy types due to the following check:
```
dtypes = {
1: np.uint8,
2: np.int8,
3: np.int16,
4: np.int32,
5: np.int64,
6: np.float64,
7: np.double,
8: np.uint16,
9: np.uint32,
10: np.uint64
}
def code(dtype):
for k in dtypes.keys():
if dtypes[k] == dtype:
return k
raise ValueError(dtype)
```
Now the issue is that python and torch types are not equal (in python)
for the same representation:
```
> type(int) == type(np.int64)
True
> type(torch.int64) == type(np.int64)
False
```
And the user-specified `metric_function` needs to return a tensor, so it
will have automatically have a torch type. If the user does not specify
a tensor, then this fails:
In `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py`:
```
def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results):
for m_idx in range(len(metric_types)):
[...]
if metric_type == 'single_value_per_sample':
metric_values = metric_function(data)
for row in range(metric_values.size()[0]):
```
Only a `torch.Tensor` has the `.size()` attribute:
```
> np.array([1,2,3]).size()
TypeError: 'int' object is not callable
> torch.tensor([1,2,3]).size()
torch.Size([3])
```
So to my understanding: the user must create a `DataAnalyser` with a
`metric_dtypes` which is of a numpy dtype, yet needs to provide a
`metric_function` function that returns a torch dtype that **must match
the same data type as numpy**, e.g.
```
def metric_functions(int_list):
return torch.tensor(int_list).as(torch.int64). #<-- TORCH type required here
data_analyzer = DataAnalyzer(
dataset=train_dataset,
metric_names=["seqlen"],
metric_functions=[metric_functions],
metric_types=['single_value_per_sample'],
metric_dtypes=[np.int64], ### <--- NUMPY type required here
)
```
Finally there's no datatype check, so if a user forgets to add
`.as(torch.int64)` to the `metric_functions`, then the files output by
threads will be called e.g.
`seqlen/worker0_thread0/seqlen_metric_to_sample_730.0.csv` as the
integer `730` is defaulted to `float`. This would later fail as the
reduce step would look for
`seqlen/worker0_thread0/seqlen_metric_to_sample_730.csv` instead.
This PR adds support to both `np.ndarray` and `torch.tensor` return
dtypes on function `metric_function`. When dealing with tensors, it
converts to the corresponding numpy dtype before outputting. It also
adds several `asserts` to make sure use provides the correct return type
and dtype on `metric_function` and `metric_dtype`, respectively.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>