Improve padding util for compile (#7355)
This PR improves `pad_tensors` in `deepspeed/compile/util.py`, which
pads tensors so that all ranks have tensors with the same shape.
Previously, this function only adjusts tensor shapes, but tensor strides
could differ across ranks, leading to recompilation on only some ranks.
As DeepCompile inserts communication operators in the graph, the
communication collective easily gets stuck.
To address this issue, this PR replaces the use of
`torch.nn.functional.pad` with a new approach that ensures consistent
strides and avoids communication issues during distributed operations.
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>