ji-huazhong1 year ago (edited 1 year ago)👍 1

What does this PR do?

Part of #2122

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @muellerzr

device agnostic deepspeed testing

aa0598c6

device agnostic fsdp testing

3d6a2a8f

ji-huazhong marked this pull request as draft 1 year ago

fix failing deepspeed test

3dd8e440

ji-huazhong1 year ago (edited 1 year ago)

verified with npu:

test_deepspeed.py

(inference) [root@localhost /home/inference/accelerate]# RUN_SLOW=1 python -m pytest -v tests/deepspeed/
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0 -- /root/anaconda3/envs/inference/bin/python
cachedir: .pytest_cache
rootdir: /home/inference/accelerate
collected 20 items

tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_accelerate_state_deepspeed_bf16 PASSED                                                                         [  5%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_accelerate_state_deepspeed_fp16 PASSED                                                                         [ 10%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig PASSED                                                                                       [ 15%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_bf16 PASSED                                                                   [ 20%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_fp16 PASSED                                                                   [ 25%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_basic_run PASSED                                                                                               [ 30%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_deepspeed_plugin_zero2 PASSED                                                                                  [ 35%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_deepspeed_plugin_zero3 PASSED                                                                                  [ 40%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_ds_config_assertions PASSED                                                                                    [ 45%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_ds_config_zero2 PASSED                                                                                         [ 50%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_ds_config_zero3 PASSED                                                                                         [ 55%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_init_zero3 PASSED                                                                                              [ 60%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_custom_optimizer_custom_scheduler
PASSED                                                     [ 65%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_custom_optimizer_deepspeed_scheduler PASSED                                                  [ 70%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_deepspeed_optimizer_custom_scheduler PASSED                                                  [ 75%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_deepspeed_optimizer_deepspeed_scheduler PASSED                                               [ 80%]
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_save_checkpoints PASSED                                                                                        [ 85%]
tests/deepspeed/test_deepspeed.py::DeepSpeedIntegrationTest::test_checkpointing PASSED                                                                                             [ 90%]
tests/deepspeed/test_deepspeed.py::DeepSpeedIntegrationTest::test_peak_memory_usage PASSED                                                                                         [ 95%]
tests/deepspeed/test_deepspeed.py::DeepSpeedIntegrationTest::test_performance PASSED                                                                                               [100%]

==================================================================================== warnings summary ====================================================================================
../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:121
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:121: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("best_of")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:140
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:140: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("repetition_penalty")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:146
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:146: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("seed")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:152
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:152: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("temperature")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:158
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:158: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("top_k")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:164
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:164: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("top_p")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:170
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:170: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("truncate")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:176
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:176: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("typical_p")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:204
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:204: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("inputs")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:210
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/huggingface_hub/inference/_text_generation.py:210: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.5/migration/
    @validator("stream")

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/torch_npu/dynamo/torchair/__init__.py:2
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/torch_npu/dynamo/torchair/__init__.py:2: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/transformers/utils/import_utils.py:329
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/transformers/utils/import_utils.py:329: FutureWarning: The util is_torch_bf16_available is deprecated, please use is_torch_bf16_gpu_available or is_torch_bf16_cpu_available instead according to whether it's used with cpu or gpu
    warnings.warn(

tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_accelerate_state_deepspeed_bf16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_accelerate_state_deepspeed_fp16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_bf16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_fp16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_deepspeed_plugin_zero2
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_deepspeed_plugin_zero3
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_ds_config_zero2
  /home/inference/accelerate/src/accelerate/utils/dataclasses.py:659: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
    warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")

tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_accelerate_state_deepspeed_bf16
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/deepspeed/comm/comm.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
    utils.logger.warn("HCCL backend in DeepSpeed not yet implemented")

tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_bf16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_autofill_dsconfig_from_ds_plugin_fp16
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_custom_optimizer_custom_scheduler
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_custom_optimizer_deepspeed_scheduler
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_deepspeed_optimizer_custom_scheduler
tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_prepare_deepspeed_deepspeed_optimizer_deepspeed_scheduler

tests/deepspeed/test_deepspeed.py::DeepSpeedConfigIntegration::test_init_zero3
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 20 passed, 29 warnings in 1009.75s (0:16:49) ======================================================================
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
Adam Optimizer #1 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1

test_deepspeed.py

(inference) [root@localhost /home/inference/accelerate]# RUN_SLOW=1 python -m pytest -v  tests/fsdp
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.8.18, pytest-7.4.3, pluggy-1.3.0 -- /root/anaconda3/envs/inference/bin/python
cachedir: .pytest_cache
rootdir: /home/inference/accelerate
collected 9 items

tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_auto_wrap_policy PASSED                                                                                                       [ 11%]
tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_backward_prefetch PASSED                                                                                                      [ 22%]
tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_cpu_offload PASSED                                                                                                            [ 33%]
tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_mixed_precision PASSED                                                                                                        [ 44%]
tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_sharding_strategy PASSED                                                                                                      [ 55%]
tests/fsdp/test_fsdp.py::FSDPPluginIntegration::test_state_dict_type PASSED                                                                                                        [ 66%]
tests/fsdp/test_fsdp.py::FSDPIntegrationTest::test_checkpointing FAILED                                                                                                            [ 77%]
tests/fsdp/test_fsdp.py::FSDPIntegrationTest::test_peak_memory_usage PASSED                                                                                                        [ 88%]
tests/fsdp/test_fsdp.py::FSDPIntegrationTest::test_performance PASSED                                                                                                              [100%]

======================================================================================== FAILURES ========================================================================================
_________________________________________________________________________ FSDPIntegrationTest.test_checkpointing _________________________________________________________________________

self = <test_fsdp.FSDPIntegrationTest testMethod=test_checkpointing>

    def test_checkpointing(self):
        self.test_file_path = os.path.join(self.test_scripts_folder, "test_checkpointing.py")
        cmd = [
            "accelerate",
            "launch",
            "--num_processes=2",
            "--num_machines=1",
            "--machine_rank=0",
            "--use_fsdp",
            "--mixed_precision=fp16",
            "--fsdp_transformer_layer_cls_to_wrap=BertLayer",
        ]

        for i, strategy in enumerate(FSDP_SHARDING_STRATEGY):
            cmd_config = cmd.copy()
            cmd_config.append(f"--fsdp_sharding_strategy={i+1}")
            if strategy != "FULL_SHARD":
                continue
            state_dict_config_index = len(cmd_config)
            for state_dict_type in FSDP_STATE_DICT_TYPE:
                # Todo: Currently failing for `LOCAL_STATE_DICT` with error
                # Unexpected key(s) in state_dict: "_fsdp_wrapped_module._flat_param".
                if state_dict_type == "LOCAL_STATE_DICT":
                    continue

                cmd_config = cmd_config[:state_dict_config_index]
                cmd_config.append(f"--fsdp_state_dict_type={state_dict_type}")
                cmd_config.extend(
                    [
                        self.test_file_path,
                        f"--output_dir={self.tmpdir}",
                        "--partial_train_epoch=1",
                    ]
                )
                with patch_environment(omp_num_threads=1):
>                   execute_subprocess_async(cmd_config, env=os.environ.copy())

tests/fsdp/test_fsdp.py:270:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cmd = ['accelerate', 'launch', '--num_processes=2', '--num_machines=1', '--machine_rank=0', '--use_fsdp', ...]
env = {'ASCEND_AICPU_PATH': '/home/inference/ascend-toolkit/latest', 'ASCEND_HOME_PATH': '/home/inference/ascend-toolkit/lat...ATH': '/home/inference/ascend-toolkit/latest/opp', 'ASCEND_TOOLKIT_HOME': '/home/inference/ascend-toolkit/latest', ...}
stdin = None, timeout = 180, quiet = False, echo = True

    def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False, echo=True) -> _RunOutput:
        loop = asyncio.get_event_loop()
        result = loop.run_until_complete(
            _stream_subprocess(cmd, env=env, stdin=stdin, timeout=timeout, quiet=quiet, echo=echo)
        )

        cmd_str = " ".join(cmd)
        if result.returncode > 0:
            stderr = "\n".join(result.stderr)
>           raise RuntimeError(
                f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
                f"The combined stderr from workers follows:\n{stderr}"
            )
E           RuntimeError: 'accelerate launch --num_processes=2 --num_machines=1 --machine_rank=0 --use_fsdp --mixed_precision=fp16 --fsdp_transformer_layer_cls_to_wrap=BertLayer --fsdp_sharding_strategy=1 --fsdp_state_dict_type=SHARDED_STATE_DICT /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py --output_dir=/tmp/tmpr7c7lgd0 --partial_train_epoch=1' failed with returncode 1
E
E           The combined stderr from workers follows:
E           Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Map: 100%|██████████| 3668/3668 [00:00<00:00, 8476.72 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7054.42 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 8346.33 examples/s]
Map: 100%|██████████| 3668/3668 [00:00<00:00, 6709.36 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7915.79 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 9205.41 examples/s]
E           Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
E           You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
E           Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
E           You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
E           You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
E           You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
E           Traceback (most recent call last):
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 269, in <module>
E               main()
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 265, in main
E               training_function(config, args)
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 214, in training_function
E               accelerator.save_state(output_dir)
E             File "/home/inference/accelerate/src/accelerate/accelerator.py", line 2666, in save_state
E               save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
E             File "/home/inference/accelerate/src/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
E               dist_cp.save_state_dict(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
E               central_plan = distW.reduce_scatter("plan", local_step, global_step)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
E               all_data = self.gather_object(local_data)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
E               dist.gather_object(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
E               gather(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
E               work = default_pg.gather(output_tensors, input_tensors, opts)
E           RuntimeError: ProcessGroupHCCL does not support gather
E           Traceback (most recent call last):
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 269, in <module>
E               main()
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 265, in main
E               training_function(config, args)
E             File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 214, in training_function
E               accelerator.save_state(output_dir)
E             File "/home/inference/accelerate/src/accelerate/accelerator.py", line 2666, in save_state
E               save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
E             File "/home/inference/accelerate/src/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
E               dist_cp.save_state_dict(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
E               central_plan = distW.reduce_scatter("plan", local_step, global_step)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
E               all_data = self.gather_object(local_data)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
E               dist.gather_object(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
E               gather(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
E               return func(*args, **kwargs)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
E               work = default_pg.gather(output_tensors, input_tensors, opts)
E           RuntimeError: ProcessGroupHCCL does not support gather
E           [2023-12-11 11:21:12,402] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1184029 closing signal SIGTERM
E           [2023-12-11 11:21:14,218] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1184030) of binary: /root/anaconda3/envs/inference/bin/python3.8
E           Traceback (most recent call last):
E             File "/root/anaconda3/envs/inference/bin/accelerate", line 8, in <module>
E               sys.exit(main())
E             File "/home/inference/accelerate/src/accelerate/commands/accelerate_cli.py", line 47, in main
E               args.func(args)
E             File "/home/inference/accelerate/src/accelerate/commands/launch.py", line 1004, in launch_command
E               multi_gpu_launcher(args)
E             File "/home/inference/accelerate/src/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
E               distrib_run.run(args)
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
E               elastic_launch(
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
E               return launch_agent(self._config, self._entrypoint, list(args))
E             File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
E               raise ChildFailedError(
E           torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
E           ============================================================
E           /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py FAILED
E           ------------------------------------------------------------
E           Failures:
E             <NO_OTHER_FAILURES>
E           ------------------------------------------------------------
E           Root Cause (first observed failure):
E           [0]:
E             time      : 2023-12-11_11:21:12
E             host      : localhost.localdomain
E             rank      : 1 (local_rank: 1)
E             exitcode  : 1 (pid: 1184030)
E             error_file: <N/A>
E             traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
E           ============================================================
E           /root/anaconda3/envs/inference/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 41 leaked semaphore objects to clean up at shutdown
E             warnings.warn('resource_tracker: There appear to be %d '

src/accelerate/test_utils/testing.py:465: RuntimeError
---------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------

Running:  accelerate launch --num_processes=2 --num_machines=1 --machine_rank=0 --use_fsdp --mixed_precision=fp16 --fsdp_transformer_layer_cls_to_wrap=BertLayer --fsdp_sharding_strategy=1 --fsdp_state_dict_type=FULL_STATE_DICT /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py --output_dir=/tmp/tmpr7c7lgd0 --partial_train_epoch=1
stdout: ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
stdout: ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
stdout: epoch 0: {'accuracy': 0.7720588235294118, 'lr': 1e-05, 'optimizer_lr': 1e-05, 'epoch': 0, 'step': 115}

Running:  accelerate launch --num_processes=2 --num_machines=1 --machine_rank=0 --use_fsdp --mixed_precision=fp16 --fsdp_transformer_layer_cls_to_wrap=BertLayer --fsdp_sharding_strategy=1 --fsdp_state_dict_type=FULL_STATE_DICT /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py --output_dir=/tmp/tmpr7c7lgd0 --resume_from_checkpoint=/tmp/tmpr7c7lgd0/epoch_0
stdout: ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
stdout: resumed checkpoint performance: 0.7720588235294118
stdout: resumed checkpoint's scheduler's lr: 1e-05
stdout: resumed optimizers's lr: 1e-05

Running:  accelerate launch --num_processes=2 --num_machines=1 --machine_rank=0 --use_fsdp --mixed_precision=fp16 --fsdp_transformer_layer_cls_to_wrap=BertLayer --fsdp_sharding_strategy=1 --fsdp_state_dict_type=SHARDED_STATE_DICT /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py --output_dir=/tmp/tmpr7c7lgd0 --partial_train_epoch=1
---------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------
stderr: Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Map: 100%|██████████| 3668/3668 [00:00<00:00, 9167.83 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7754.91 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 9365.63 examples/s]
Map: 100%|██████████| 3668/3668 [00:00<00:00, 6761.60 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7297.96 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 8793.45 examples/s]
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: [W ProcessGroupHCCL.cpp:1463] Warning: The current allgather operator has a defect in handling different tensor shape,         the work event forces a wait operation, and the allgather wait on the python side would be fake (function operator())
stderr: [W ProcessGroupHCCL.cpp:1463] Warning: The current allgather operator has a defect in handling different tensor shape,         the work event forces a wait operation, and the allgather wait on the python side would be fake (function operator())
stderr: Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Map: 100%|██████████| 3668/3668 [00:00<00:00, 7585.03 examples/s]
Map: 100%|██████████| 3668/3668 [00:00<00:00, 6720.25 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 5946.37 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 6294.83 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 8027.00 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 8195.88 examples/s]
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Map: 100%|██████████| 3668/3668 [00:00<00:00, 8476.72 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7054.42 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 8346.33 examples/s]
Map: 100%|██████████| 3668/3668 [00:00<00:00, 6709.36 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 7915.79 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 9205.41 examples/s]
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
stderr: You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
stderr: Traceback (most recent call last):
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 269, in <module>
stderr:     main()
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 265, in main
stderr:     training_function(config, args)
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 214, in training_function
stderr:     accelerator.save_state(output_dir)
stderr:   File "/home/inference/accelerate/src/accelerate/accelerator.py", line 2666, in save_state
stderr:     save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
stderr:   File "/home/inference/accelerate/src/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
stderr:     dist_cp.save_state_dict(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
stderr:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
stderr:     all_data = self.gather_object(local_data)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
stderr:     dist.gather_object(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
stderr:     gather(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
stderr:     work = default_pg.gather(output_tensors, input_tensors, opts)
stderr: RuntimeError: ProcessGroupHCCL does not support gather
stderr: Traceback (most recent call last):
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 269, in <module>
stderr:     main()
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 265, in main
stderr:     training_function(config, args)
stderr:   File "/home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py", line 214, in training_function
stderr:     accelerator.save_state(output_dir)
stderr:   File "/home/inference/accelerate/src/accelerate/accelerator.py", line 2666, in save_state
stderr:     save_fsdp_model(self.state.fsdp_plugin, self, model, output_dir, i)
stderr:   File "/home/inference/accelerate/src/accelerate/utils/fsdp_utils.py", line 72, in save_fsdp_model
stderr:     dist_cp.save_state_dict(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/state_dict_saver.py", line 113, in save_state_dict
stderr:     central_plan = distW.reduce_scatter("plan", local_step, global_step)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 177, in reduce_scatter
stderr:     all_data = self.gather_object(local_data)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/checkpoint/utils.py", line 108, in gather_object
stderr:     dist.gather_object(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2509, in gather_object
stderr:     gather(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3078, in gather
stderr:     work = default_pg.gather(output_tensors, input_tensors, opts)
stderr: RuntimeError: ProcessGroupHCCL does not support gather
stderr: [2023-12-11 11:21:12,402] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1184029 closing signal SIGTERM
stderr: [2023-12-11 11:21:14,218] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1184030) of binary: /root/anaconda3/envs/inference/bin/python3.8
stderr: Traceback (most recent call last):
stderr:   File "/root/anaconda3/envs/inference/bin/accelerate", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/inference/accelerate/src/accelerate/commands/accelerate_cli.py", line 47, in main
stderr:     args.func(args)
stderr:   File "/home/inference/accelerate/src/accelerate/commands/launch.py", line 1004, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/home/inference/accelerate/src/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
stderr:     elastic_launch(
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/root/anaconda3/envs/inference/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /home/inference/accelerate/src/accelerate/test_utils/scripts/external_deps/test_checkpointing.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr:   <NO_OTHER_FAILURES>
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2023-12-11_11:21:12
stderr:   host      : localhost.localdomain
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 1184030)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
stderr: /root/anaconda3/envs/inference/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 41 leaked semaphore objects to clean up at shutdown
stderr:   warnings.warn('resource_tracker: There appear to be %d '
==================================================================================== warnings summary ====================================================================================
../../../root/anaconda3/envs/inference/lib/python3.8/site-packages/torch_npu/dynamo/torchair/__init__.py:2
  /root/anaconda3/envs/inference/lib/python3.8/site-packages/torch_npu/dynamo/torchair/__init__.py:2: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================ short test summary info =================================================================================
FAILED tests/fsdp/test_fsdp.py::FSDPIntegrationTest::test_checkpointing - RuntimeError: 'accelerate launch --num_processes=2 --num_machines=1 --machine_rank=0 --use_fsdp --mixed_precision=fp16 --fsdp_transformer_layer_cls_to_wrap=BertLayer --fsdp_sharding...
================================================================== 1 failed, 8 passed, 3 warnings in 567.17s (0:09:27) ===================================================================

NPU does not support the operator "gather" now.

ji-huazhong marked this pull request as ready for review 1 year ago

muellerzr approved these changes on 2023-12-11

muellerzr1 year ago

Tentatively okay with just a blanket pass, since those only fail on NPU and we don't have runners for it, skipping isn't critical here I think. cc @pacman100

muellerzr requested a review from

pacman100 1 year ago

HuggingFaceDocBuilderDev1 year ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr1 year ago❤ 1

@statelesshz can you run make style; make quality please?

make style

c9a13fe9

ji-huazhong1 year ago

the CI is green

ji-huazhong1 year ago

Hi @pacman100 please take a look at this PR, thx :-)

pacman100 approved these changes on 2023-12-20

pacman1001 year ago

Wow! nice way to extend the tests to more devices, good work @statelesshz!

Merge branch 'main' into device-agnostic-testing

cbd558a1

muellerzr merged b565a6c5 into main 1 year ago

accelerate
device agnostic deepspeed&fsdp testing
#2235

Merged

device agnostic deepspeed&fsdp testing #2235

What does this PR do?

Before submitting

Who can review?

accelerate device agnostic deepspeed&fsdp testing #2235 Merged

device agnostic deepspeed&fsdp testing #2235

What does this PR do?

Before submitting

Who can review?

accelerate
device agnostic deepspeed&fsdp testing
#2235

Merged