pytorch
8a1dab3d - [tsm] add support for jetter to Role (base_image) for mast launches

Commit
3 years ago
[tsm] add support for jetter to Role (base_image) for mast launches Summary: 1. Adds `ml_image` buck macro 2. Adds `--run_path` option to `torch.distributed.run` 3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting) 4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast) NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big) Test Plan: ## Local Testing ``` # build the two fbpkgs (base and main) buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum # fetch the fbpkgs cd ~/tmp fbpkg fetch --symlink-tags -o -d . jetter:prod fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum.base fbpkg fetch --symlink-tags -o -d . torchx.examples.dist_sum jetter/LAST/jetter apply-and-run \ torchx.examples.dist_sum.base/LAST/torchrun \ torchx.examples.dist_sum/LAST \ -- \ --as_function \ --rdzv_id foobar \ --nnodes 1 \ --nproc_per_node 2 \ --max_restarts 0 \ --role worker \ --no_python \ ~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py ``` ## Mast Testing ``` buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \ --scheduler mast --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \ --fbpkg torchx.examples.dist_sum:f38ab46 \ --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \ --nnodes 2 \ --resource T1 \ --nproc_per_node 4 \ --name kiuk_jetter_test \ pytorch/elastic/examples/distributed_sum/fb/main.py ``` Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa? Reviewed By: tierex, yifuwang Differential Revision: D28177553 fbshipit-source-id: 29daada4bc26e5ef0949bf75215f35e557bd35b8
Author
Kiuk Chung
Parents
Loading