Cross layer overlapping for domino (#7178)
1. Add implementation for cross layer communication overlapping to
achieve communication "free".
2. Optimize the implementation for communication overlapping within
transformer layer.
Signed-off-by: Hongwei Chen <hongweichen@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>