Remove assumption that padding only occurs on last rank (#6974)
As discussed in
[PR-6918](https://github.com/microsoft/DeepSpeed/pull/6918), padding can
occur on multiple ranks with large DP degrees.
For example, with:
- Flattened tensor size: 266240
- DP degree: 768
- Alignment: 1536
- Required padding: 1024 (1536 * 174 - 266240)
- Per-rank partition size: 348 (1536 * 174 / 768)
- The padding occurs on last three ranks.
This PR removes the single-rank padding assumption for more general
cases.
---------
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>