shard `pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed` 1->2
Fixes #ISSUE_NUMBER
shard `pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed ...` from 1 shard to 2
Pros:
- It currently takes about 2.6 hours and is 3rd longest running job on pull
- Theoretically minimal overhead
Cons:
- Requires changes to the run_test.py which might have correctness issues
Notes:
- Cannot shard further as one of the test files is responsible for about half of the total run time
spreadsheet regarding sharding: https://docs.google.com/spreadsheets/d/1BdtVsjRr0Is9LXMNilR02FEdPXNq7zEWl8AmR3ArsLQ/edit#gid=1153012347
Test Plan:
<details><summary>expand to see test plan (its long)</summary>
tests from a commit ran on master (90 tests ran)
```
2022-05-03T12:45:34.7974184Z Selected tests:
2022-05-03T12:45:34.7974495Z distributed/_shard/sharded_optim/test_sharded_optim
2022-05-03T12:45:34.7974839Z distributed/_shard/sharded_tensor/ops/test_binary_cmp
2022-05-03T12:45:34.7975209Z distributed/_shard/sharded_tensor/ops/test_elementwise_ops
2022-05-03T12:45:34.7975575Z distributed/_shard/sharded_tensor/ops/test_embedding
2022-05-03T12:45:34.7976180Z distributed/_shard/sharded_tensor/ops/test_embedding_bag
2022-05-03T12:45:34.7976802Z distributed/_shard/sharded_tensor/ops/test_init
2022-05-03T12:45:34.7977361Z distributed/_shard/sharded_tensor/ops/test_linear
2022-05-03T12:45:34.7978157Z distributed/_shard/sharded_tensor/ops/test_math_ops
2022-05-03T12:45:34.7978879Z distributed/_shard/sharded_tensor/test_megatron_prototype
2022-05-03T12:45:34.7979594Z distributed/_shard/sharded_tensor/test_sharded_tensor
2022-05-03T12:45:34.7980366Z distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
2022-05-03T12:45:34.7981066Z distributed/_shard/sharding_plan/test_sharding_plan
2022-05-03T12:45:34.7981877Z distributed/_shard/sharding_spec/test_sharding_spec
2022-05-03T12:45:34.7982387Z distributed/_shard/test_partial_tensor
2022-05-03T12:45:34.7982691Z distributed/_shard/test_replicated_tensor
2022-05-03T12:45:34.7982994Z distributed/_shard/test_sharder
2022-05-03T12:45:34.7983280Z distributed/algorithms/test_join
2022-05-03T12:45:34.7983695Z distributed/elastic/events/lib_test
2022-05-03T12:45:34.7983984Z distributed/elastic/metrics/api_test
2022-05-03T12:45:34.7984308Z distributed/elastic/multiprocessing/api_test
2022-05-03T12:45:34.7984624Z distributed/elastic/timer/api_test
2022-05-03T12:45:34.7984924Z distributed/elastic/timer/local_timer_example
2022-05-03T12:45:34.7985254Z distributed/elastic/timer/local_timer_test
2022-05-03T12:45:34.7985575Z distributed/elastic/utils/distributed_test
2022-05-03T12:45:34.7985889Z distributed/elastic/utils/logging_test
2022-05-03T12:45:34.7986176Z distributed/elastic/utils/util_test
2022-05-03T12:45:34.7986492Z distributed/fsdp/test_flatten_params_wrapper
2022-05-03T12:45:34.7986799Z distributed/fsdp/test_fsdp_apply
2022-05-03T12:45:34.7987078Z distributed/fsdp/test_fsdp_checkpoint
2022-05-03T12:45:34.7987388Z distributed/fsdp/test_fsdp_clip_grad_norm
2022-05-03T12:45:34.7987691Z distributed/fsdp/test_fsdp_comm
2022-05-03T12:45:34.7987961Z distributed/fsdp/test_fsdp_core
2022-05-03T12:45:34.7988251Z distributed/fsdp/test_fsdp_exec_order
2022-05-03T12:45:34.7988570Z distributed/fsdp/test_fsdp_freezing_weights
2022-05-03T12:45:34.7988865Z distributed/fsdp/test_fsdp_grad_acc
2022-05-03T12:45:34.7989176Z distributed/fsdp/test_fsdp_ignored_modules
2022-05-03T12:45:34.7989478Z distributed/fsdp/test_fsdp_input
2022-05-03T12:45:34.7989950Z distributed/fsdp/test_fsdp_memory
2022-05-03T12:45:34.7990241Z distributed/fsdp/test_fsdp_meta
2022-05-03T12:45:34.7990640Z distributed/fsdp/test_fsdp_mixed_precision
2022-05-03T12:45:34.7990964Z distributed/fsdp/test_fsdp_multiple_forward
2022-05-03T12:45:34.7991293Z distributed/fsdp/test_fsdp_multiple_wrapping
2022-05-03T12:45:34.7991610Z distributed/fsdp/test_fsdp_optim_state
2022-05-03T12:45:34.7991895Z distributed/fsdp/test_fsdp_overlap
2022-05-03T12:45:34.7992195Z distributed/fsdp/test_fsdp_pure_fp16
2022-05-03T12:45:34.7992500Z distributed/fsdp/test_fsdp_state_dict
2022-05-03T12:45:34.7992818Z distributed/fsdp/test_fsdp_summon_full_params
2022-05-03T12:45:34.7993117Z distributed/fsdp/test_fsdp_traversal
2022-05-03T12:45:34.7993861Z distributed/fsdp/test_fsdp_uneven
2022-05-03T12:45:34.7994181Z distributed/fsdp/test_shard_utils
2022-05-03T12:45:34.7994447Z distributed/fsdp/test_utils
2022-05-03T12:45:34.7994721Z distributed/fsdp/test_wrap
2022-05-03T12:45:34.7995015Z distributed/nn/jit/test_instantiator
2022-05-03T12:45:34.7995328Z distributed/optim/test_zero_redundancy_optimizer
2022-05-03T12:45:34.7995664Z distributed/pipeline/sync/skip/test_api
2022-05-03T12:45:34.7995983Z distributed/pipeline/sync/skip/test_gpipe
2022-05-03T12:45:34.7996315Z distributed/pipeline/sync/skip/test_inspect_skip_layout
2022-05-03T12:45:34.7996652Z distributed/pipeline/sync/skip/test_leak
2022-05-03T12:45:34.7996977Z distributed/pipeline/sync/skip/test_portal
2022-05-03T12:45:34.7997292Z distributed/pipeline/sync/skip/test_stash_pop
2022-05-03T12:45:34.7997623Z distributed/pipeline/sync/skip/test_tracker
2022-05-03T12:45:34.7997968Z distributed/pipeline/sync/skip/test_verify_skippables
2022-05-03T12:45:34.7998301Z distributed/pipeline/sync/test_balance
2022-05-03T12:45:34.7998591Z distributed/pipeline/sync/test_bugs
2022-05-03T12:45:34.7998927Z distributed/pipeline/sync/test_checkpoint
2022-05-03T12:45:34.7999243Z distributed/pipeline/sync/test_copy
2022-05-03T12:45:34.7999557Z distributed/pipeline/sync/test_deferred_batch_norm
2022-05-03T12:45:34.7999896Z distributed/pipeline/sync/test_dependency
2022-05-03T12:45:34.8000215Z distributed/pipeline/sync/test_inplace
2022-05-03T12:45:34.8000516Z distributed/pipeline/sync/test_microbatch
2022-05-03T12:45:34.8000826Z distributed/pipeline/sync/test_phony
2022-05-03T12:45:34.8001130Z distributed/pipeline/sync/test_pipe
2022-05-03T12:45:34.8001424Z distributed/pipeline/sync/test_pipeline
2022-05-03T12:45:34.8001733Z distributed/pipeline/sync/test_stream
2022-05-03T12:45:34.8002055Z distributed/pipeline/sync/test_transparency
2022-05-03T12:45:34.8002353Z distributed/pipeline/sync/test_worker
2022-05-03T12:45:34.8002672Z distributed/rpc/cuda/test_tensorpipe_agent
2022-05-03T12:45:34.8002982Z distributed/rpc/test_faulty_agent
2022-05-03T12:45:34.8003270Z distributed/rpc/test_tensorpipe_agent
2022-05-03T12:45:34.8003568Z distributed/test_c10d_common
2022-05-03T12:45:34.8003839Z distributed/test_c10d_gloo
2022-05-03T12:45:34.8004088Z distributed/test_c10d_nccl
2022-05-03T12:45:34.8004369Z distributed/test_c10d_spawn_gloo
2022-05-03T12:45:34.8004656Z distributed/test_c10d_spawn_nccl
2022-05-03T12:45:34.8004938Z distributed/test_data_parallel
2022-05-03T12:45:34.8005212Z distributed/test_distributed_spawn
2022-05-03T12:45:34.8005496Z distributed/test_launcher
2022-05-03T12:45:34.8005767Z distributed/test_nccl
2022-05-03T12:45:34.8006019Z distributed/test_pg_wrapper
2022-05-03T12:45:34.8006285Z distributed/test_store
```
tests ran on first shard for distributed on this PR (34 tests)
```
2022-05-02T21:26:00.1385256Z Selected tests:
2022-05-02T21:26:00.1385767Z distributed/test_distributed_spawn
2022-05-02T21:26:00.1386403Z distributed/elastic/multiprocessing/api_test
2022-05-02T21:26:00.1387051Z distributed/fsdp/test_fsdp_memory
2022-05-02T21:26:00.1387607Z distributed/fsdp/test_fsdp_ignored_modules
2022-05-02T21:26:00.1388179Z distributed/fsdp/test_fsdp_apply
2022-05-02T21:26:00.1388600Z distributed/_shard/sharded_tensor/ops/test_binary_cmp
2022-05-02T21:26:00.1389181Z distributed/_shard/sharding_spec/test_sharding_spec
2022-05-02T21:26:00.1389545Z distributed/_shard/sharded_tensor/ops/test_linear
2022-05-02T21:26:00.1389878Z distributed/fsdp/test_fsdp_uneven
2022-05-02T21:26:00.1390186Z distributed/fsdp/test_fsdp_multiple_wrapping
2022-05-02T21:26:00.1390526Z distributed/fsdp/test_fsdp_multiple_forward
2022-05-02T21:26:00.1390877Z distributed/_shard/sharded_tensor/ops/test_embedding
2022-05-02T21:26:00.1391219Z distributed/_shard/test_partial_tensor
2022-05-02T21:26:00.1391542Z distributed/_shard/sharded_optim/test_sharded_optim
2022-05-02T21:26:00.1391915Z distributed/_shard/sharded_tensor/ops/test_elementwise_ops
2022-05-02T21:26:00.1392297Z distributed/fsdp/test_flatten_params_wrapper
2022-05-02T21:26:00.1392585Z distributed/fsdp/test_utils
2022-05-02T21:26:00.1392883Z distributed/nn/jit/test_instantiator
2022-05-02T21:26:00.1393167Z distributed/test_nccl
2022-05-02T21:26:00.1393466Z distributed/_shard/sharding_plan/test_sharding_plan
2022-05-02T21:26:00.1393787Z distributed/_shard/test_sharder
2022-05-02T21:26:00.1394085Z distributed/elastic/timer/api_test
2022-05-02T21:26:00.1394383Z distributed/pipeline/sync/skip/test_api
2022-05-02T21:26:00.1394738Z distributed/pipeline/sync/skip/test_inspect_skip_layout
2022-05-02T21:26:00.1395090Z distributed/pipeline/sync/skip/test_portal
2022-05-02T21:26:00.1395424Z distributed/pipeline/sync/skip/test_tracker
2022-05-02T21:26:00.1395935Z distributed/pipeline/sync/test_balance
2022-05-02T21:26:00.1396288Z distributed/pipeline/sync/test_checkpoint
2022-05-02T21:26:00.1396635Z distributed/pipeline/sync/test_deferred_batch_norm
2022-05-02T21:26:00.1396953Z distributed/pipeline/sync/test_inplace
2022-05-02T21:26:00.1397269Z distributed/pipeline/sync/test_phony
2022-05-02T21:26:00.1397587Z distributed/pipeline/sync/test_pipeline
2022-05-02T21:26:00.1397903Z distributed/pipeline/sync/test_transparency
2022-05-02T21:26:00.1398221Z distributed/rpc/test_faulty_agent
```
tests ran on second shard for distributed on this PR (56 tests)
```
2022-05-02T21:26:55.1342892Z Selected tests:
2022-05-02T21:26:55.1343201Z distributed/rpc/cuda/test_tensorpipe_agent
2022-05-02T21:26:55.1343526Z distributed/fsdp/test_fsdp_core
2022-05-02T21:26:55.1343829Z distributed/test_c10d_nccl
2022-05-02T21:26:55.1344089Z distributed/test_c10d_gloo
2022-05-02T21:26:55.1344408Z distributed/fsdp/test_fsdp_summon_full_params
2022-05-02T21:26:55.1344749Z distributed/fsdp/test_fsdp_mixed_precision
2022-05-02T21:26:55.1345085Z distributed/optim/test_zero_redundancy_optimizer
2022-05-02T21:26:55.1345423Z distributed/fsdp/test_fsdp_optim_state
2022-05-02T21:26:55.1345773Z distributed/_shard/sharded_tensor/test_sharded_tensor
2022-05-02T21:26:55.1346088Z distributed/fsdp/test_fsdp_state_dict
2022-05-02T21:26:55.1346379Z distributed/test_store
2022-05-02T21:26:55.1346661Z distributed/test_c10d_spawn_gloo
2022-05-02T21:26:55.1346966Z distributed/test_pg_wrapper
2022-05-02T21:26:55.1347252Z distributed/test_c10d_spawn_nccl
2022-05-02T21:26:55.1347565Z distributed/fsdp/test_fsdp_clip_grad_norm
2022-05-02T21:26:55.1347871Z distributed/fsdp/test_wrap
2022-05-02T21:26:55.1348369Z distributed/fsdp/test_fsdp_grad_acc
2022-05-02T21:26:55.1348679Z distributed/algorithms/test_join
2022-05-02T21:26:55.1349004Z distributed/fsdp/test_fsdp_freezing_weights
2022-05-02T21:26:55.1349305Z distributed/fsdp/test_fsdp_comm
2022-05-02T21:26:55.1349593Z distributed/test_c10d_common
2022-05-02T21:26:55.1349885Z distributed/fsdp/test_fsdp_meta
2022-05-02T21:26:55.1350171Z distributed/fsdp/test_fsdp_exec_order
2022-05-02T21:26:55.1350486Z distributed/fsdp/test_fsdp_checkpoint
2022-05-02T21:26:55.1350798Z distributed/fsdp/test_fsdp_overlap
2022-05-02T21:26:55.1351105Z distributed/elastic/timer/local_timer_example
2022-05-02T21:26:55.1351423Z distributed/fsdp/test_fsdp_input
2022-05-02T21:26:55.1351749Z distributed/_shard/sharded_tensor/ops/test_init
2022-05-02T21:26:55.1352190Z distributed/elastic/timer/local_timer_test
2022-05-02T21:26:55.1352520Z distributed/elastic/utils/distributed_test
2022-05-02T21:26:55.1352841Z distributed/fsdp/test_fsdp_pure_fp16
2022-05-02T21:26:55.1353150Z distributed/test_data_parallel
2022-05-02T21:26:55.1353437Z distributed/fsdp/test_fsdp_traversal
2022-05-02T21:26:55.1353792Z distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
2022-05-02T21:26:55.1354174Z distributed/_shard/sharded_tensor/ops/test_embedding_bag
2022-05-02T21:26:55.1354534Z distributed/_shard/sharded_tensor/test_megatron_prototype
2022-05-02T21:26:55.1354858Z distributed/test_launcher
2022-05-02T21:26:55.1355149Z distributed/elastic/utils/util_test
2022-05-02T21:26:55.1355441Z distributed/elastic/utils/logging_test
2022-05-02T21:26:55.1355755Z distributed/elastic/metrics/api_test
2022-05-02T21:26:55.1356095Z distributed/_shard/sharded_tensor/ops/test_math_ops
2022-05-02T21:26:55.1356455Z distributed/_shard/test_replicated_tensor
2022-05-02T21:26:55.1356754Z distributed/elastic/events/lib_test
2022-05-02T21:26:55.1357065Z distributed/fsdp/test_shard_utils
2022-05-02T21:26:55.1357387Z distributed/pipeline/sync/skip/test_gpipe
2022-05-02T21:26:55.1357702Z distributed/pipeline/sync/skip/test_leak
2022-05-02T21:26:55.1358040Z distributed/pipeline/sync/skip/test_stash_pop
2022-05-02T21:26:55.1358396Z distributed/pipeline/sync/skip/test_verify_skippables
2022-05-02T21:26:55.1358716Z distributed/pipeline/sync/test_bugs
2022-05-02T21:26:55.1359027Z distributed/pipeline/sync/test_copy
2022-05-02T21:26:55.1359350Z distributed/pipeline/sync/test_dependency
2022-05-02T21:26:55.1359662Z distributed/pipeline/sync/test_microbatch
2022-05-02T21:26:55.1359983Z distributed/pipeline/sync/test_pipe
2022-05-02T21:26:55.1360299Z distributed/pipeline/sync/test_stream
2022-05-02T21:26:55.1360593Z distributed/pipeline/sync/test_worker
2022-05-02T21:26:55.1360912Z distributed/rpc/test_tensorpipe_agent
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76564
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99