I noticed that my training using the deepspeed integration in accelerate had some strange behavior, the number of iterations per epoch didn't go down as I increased number of GPUs. Also loss wasn't converging as expected.
After a bunch of debugging, I found that _prepare_deepspeed(...) doesn't appear to call _prepare_one(...) properly. It calls without setting first_pass=True, which means that prepare_one(...) skips wrapping the DataLoaders... defeating the whole point
How I tested
I added logging to my training flow to print out len(data_loader) after accelerator.prepare(...) is called.
I validated that with this fix, the length is divided by the number of processes, as expected.
set first_pass on calls from deepspeed to _prepare_one(...) so that iā¦
Summary
I noticed that my training using the deepspeed integration in accelerate had some strange behavior, the number of iterations per epoch didn't go down as I increased number of GPUs. Also loss wasn't converging as expected.
After a bunch of debugging, I found that
_prepare_deepspeed(...)
doesn't appear to call_prepare_one(...)
properly. It calls without settingfirst_pass=True
, which means thatprepare_one(...)
skips wrapping the DataLoaders... defeating the whole pointHow I tested
I added logging to my training flow to print out
len(data_loader)
afteraccelerator.prepare(...)
is called.I validated that with this fix, the length is divided by the number of processes, as expected.