huggingface_hub
8912ccd5 - output management checkpoints and final model (#630)

Commit

4 years ago

output management checkpoints and final model (#630) * output management checkpoints and final model The aim of this PR is to provide options to separate model checkpoints from final `model.tar.gz` saved to S3. In [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB `model.tar.gz` training artifact, containing all your checkpoints, most of them possibly useless for deployment. If you want to persist checkpoints to S3 while saving only the final model in the `model.tar.gz` final training artifact, you can point Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and at the end of your script save a specific model to the model location with `trainer.save_model(/opt/ml/model)`. Currently, pointing Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and saving location with `trainer.save_model(/opt/ml/model)` does NOT save the final model to the S3 resulting in an empty `output` folder in [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example It would be useful to add `Training output management` section to this documentation after `Prepare a :hugging_face: Transformers fine-tuning script`. * Update docs/sagemaker/train.md Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> * Update docs/sagemaker/train.md Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> * Enable checkpointing in estimator * Update docs/sagemaker/train.md Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> * Update train.md replace **so** with **to** an Amazon S3 location. Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

References

#630 - output management checkpoints and final model

Author

mlonaws

Parents

27534611

huggingface_hub 8912ccd5 - output management checkpoints and final model (#630)

huggingface_hub
8912ccd5 - output management checkpoints and final model (#630)