DeepSpeed
2d0a6bc2 - remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation (#5112)

Commit

2 years ago

remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation (#5112) When performing the map operation required for the curriculum learning, the output of `metric_function` requires an `index` field: ``` def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results): for m_idx in range(len(metric_types)): [...] if metric_type == 'single_value_per_sample': for row in range(metric_values.size()[0]): metric_result["sample_to_metric_builder"].add_item(metric_values[row].reshape(-1)) metric_result["metric_to_sample_dict"][metric_values[row].item()].append( data['index'][row][0].item()). ##<------- data['index']?? ``` There is no mention to this in the documentation, where it specifies that the output of `metric_function` should be a dict/DataFrame (?) with an `index` key/column. To makes things worse, on top of that, there is no way for an user to be able to specify a proper `index` value for each sample, because the distribution of samples across workers/threads is not know, as it's done inside `DataAnalysis`: ``` def run_map_helper(self, thread_id): start_idx, end_idx = self.thread_splits[thread_id][0], \ self.thread_splits[thread_id][1] logger.info(f"worker {self.worker_id} thread {thread_id}: start working " \ f"on data subset {start_idx} to {end_idx}") thread_dataset = Subset(self.dataset, list(range(start_idx, end_idx))) sampler = BatchSampler(SequentialSampler(thread_dataset), batch_size=self.batch_size, drop_last=False) ``` Since by design you picked a `SequentialSampler`, then you know beforehand the global id of each each sample of each batch of each thread of each worker by looking at ``` self.worker_splits, self.thread_splits = split_dataset(self.dataset, self.num_workers, self.worker_id, self.num_threads) start_idx, end_idx = thread_splits[t_idx_reduce][0], thread_splits[t_idx_reduce][1] ``` and you can populate that index value correctly, instead of asking the user to provide it. This PR removes the need for `'index'` key in `data` and uses instead the batch, thread, and worker ids to compute the global index of each sample.

References

#5112 - remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation

Author

bm-synth

Parents

a7864846

DeepSpeed 2d0a6bc2 - remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation (#5112)

DeepSpeed
2d0a6bc2 - remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation (#5112)