[DPER] Introduce barrier operation to force synchronization of threads in async execution (#49322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49322
In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.
This operator allows to address these issues by introducing extra explicit dependencies between ops.
Test Plan:
Unit-test/
E2E testing in the future diffs.
Reviewed By: xianjiec
Differential Revision: D24933471
fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567