pytorch
5c48ff20 - AsyncCollectiveTensor: dont sync on view ops (#105240)

Commit View On GitHub

Commit

1 year ago

AsyncCollectiveTensor: dont sync on view ops (#105240) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: https://github.com/pytorch/pytorch/blob/1518d5eec4425b74b6114d18dd8998df20d0fb0a/torch/distributed/_tensor/api.py#L207) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab

Author

wanchaol

Committer

pytorchmergebot

Parents

e1659388

pytorch 5c48ff20 - AsyncCollectiveTensor: dont sync on view ops (#105240)

Commit

pytorch
5c48ff20 - AsyncCollectiveTensor: dont sync on view ops (#105240)