Fix DeepCompile+Z3 on PyTorch v2.9/2.10 (#7951)
DeepCompile+Z3 didn't work with PyTorch v2.9/2.10 because:
- PyTorch v2.9+ started enforcing stricter TorchDynamo parameter
tensor-match guards. During DeepCompile tracing, some ZeRO-3 parameters
were temporarily all-gathered, so Dynamo recorded full sizes such as
4096
- By the time guard evaluation ran, DeepSpeed had already released those
params back to the normal ZeRO-3 partitioned representation, where
`param.data` is `empty(0)`. That produced guard failures like `expected
4096, actual 0`.
This PR resolves the issue by:
- Leep full-shape dummy tensors for symbolic tracing
- Override guard size/stride metadata for ZeRO-3 params to the stable
released representation instead of transient gathered sizes
This PR also includes fixes of these bugs:
- For v2.7 and v2.8, the compiled backward graph could hoist
`end_backward` ahead of the real `reduce_grad` calls. - Selective
unsharding pass can overcount the persistence memory budget.
Note: DeepCompile is still incompatible with v2.11. It will be addressed
by another PR.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>