DeepSpeed
d9e12d3a - Fix attention mask handling in the Hybrid Engine Bloom flow (#5101)

Commit

1 year ago

Fix attention mask handling in the Hybrid Engine Bloom flow (#5101) The Bloom flow in Hybrid Engine applies the same transformation of the input mask which is already performed earlier by the transformers BloomModel::forward. This results in the non-convergence of scores, specifically in Deepspeed Chat on different accelerators, including CUDA and HPU. The fix removes redundant mask transformation and application, producing correct convergence. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>

References

#5101 - Fix attention mask handling in the Hybrid Engine Bloom flow

Author

deepcharm

Parents

2989cf77

Files3

deepspeed
- module_inject/containers
  - bloom.py
- ops/transformer/inference
  - config.py
  - ds_attention.py

DeepSpeed d9e12d3a - Fix attention mask handling in the Hybrid Engine Bloom flow (#5101)

DeepSpeed
d9e12d3a - Fix attention mask handling in the Hybrid Engine Bloom flow (#5101)