DeepSpeed
3e5d4004 - Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. (#5108)

Commit

1 year ago

Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. (#5108) The map-reduce types are a mess. By looking at the file `deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py`, we see that the reduce only accepts numpy types due to the following check: ``` dtypes = { 1: np.uint8, 2: np.int8, 3: np.int16, 4: np.int32, 5: np.int64, 6: np.float64, 7: np.double, 8: np.uint16, 9: np.uint32, 10: np.uint64 } def code(dtype): for k in dtypes.keys(): if dtypes[k] == dtype: return k raise ValueError(dtype) ``` Now the issue is that python and torch types are not equal (in python) for the same representation: ``` > type(int) == type(np.int64) True > type(torch.int64) == type(np.int64) False ``` And the user-specified `metric_function` needs to return a tensor, so it will have automatically have a torch type. If the user does not specify a tensor, then this fails: In `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py`: ``` def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results): for m_idx in range(len(metric_types)): [...] if metric_type == 'single_value_per_sample': metric_values = metric_function(data) for row in range(metric_values.size()[0]): ``` Only a `torch.Tensor` has the `.size()` attribute: ``` > np.array([1,2,3]).size() TypeError: 'int' object is not callable > torch.tensor([1,2,3]).size() torch.Size([3]) ``` So to my understanding: the user must create a `DataAnalyser` with a `metric_dtypes` which is of a numpy dtype, yet needs to provide a `metric_function` function that returns a torch dtype that **must match the same data type as numpy**, e.g. ``` def metric_functions(int_list): return torch.tensor(int_list).as(torch.int64). #<-- TORCH type required here data_analyzer = DataAnalyzer( dataset=train_dataset, metric_names=["seqlen"], metric_functions=[metric_functions], metric_types=['single_value_per_sample'], metric_dtypes=[np.int64], ### <--- NUMPY type required here ) ``` Finally there's no datatype check, so if a user forgets to add `.as(torch.int64)` to the `metric_functions`, then the files output by threads will be called e.g. `seqlen/worker0_thread0/seqlen_metric_to_sample_730.0.csv` as the integer `730` is defaulted to `float`. This would later fail as the reduce step would look for `seqlen/worker0_thread0/seqlen_metric_to_sample_730.csv` instead. This PR adds support to both `np.ndarray` and `torch.tensor` return dtypes on function `metric_function`. When dealing with tensors, it converts to the corresponding numpy dtype before outputting. It also adds several `asserts` to make sure use provides the correct return type and dtype on `metric_function` and `metric_dtype`, respectively. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

References

#5108 - Solve tensor vs numpy dtype conflicts in data efficiency map-reduce.

Author

bm-synth

Parents

b20c4674

DeepSpeed 3e5d4004 - Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. (#5108)

DeepSpeed
3e5d4004 - Solve tensor vs numpy dtype conflicts in data efficiency map-reduce. (#5108)