output management checkpoints and final model (#630)
* output management checkpoints and final model
The aim of this PR is to provide options to separate model checkpoints from final `model.tar.gz` saved to S3.
In [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example, training a model should result in ~500MB, but the model saves all checkpoints which results in a 12GB `model.tar.gz` training artifact, containing all your checkpoints, most of them possibly useless for deployment.
If you want to persist checkpoints to S3 while saving only the final model in the `model.tar.gz` final training artifact, you can point Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and at the end of your script save a specific model to the model location with `trainer.save_model(/opt/ml/model)`. Currently, pointing Hugging Face Trainer’s `output_dir` to `/opt/ml/checkpoints` and saving location with `trainer.save_model(/opt/ml/model)` does NOT save the final model to the S3 resulting in an empty `output` folder in [this](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb) example
It would be useful to add `Training output management` section to this documentation after `Prepare a :hugging_face: Transformers fine-tuning script`.
* Update docs/sagemaker/train.md
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
* Update docs/sagemaker/train.md
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
* Enable checkpointing in estimator
* Update docs/sagemaker/train.md
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
* Update train.md
replace **so** with **to** an Amazon S3 location.
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>