Give example on how to handle gradient accumulation with cross-entropy (#3193)
* Add cross-entropy example in the gradient accumulation docs
* add example of logs
* correct skeleton code
* replace gather_for_metrics with gather
* batch_size -> per_device_batch_size
* remove main_process_only=True
* add autoregressive example in examples/
* Update docs/source/usage_guides/gradient_accumulation.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* ruff format
* add grad accum test
* update docs
* Update examples/by_feature/gradient_accumulation_for_autoregressive_models.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* update tests
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>