Squashed commit of the following:

Commit

205 days ago

Squashed commit of the following: commit 2f8fd72e5112beb24082c252f8aa5e621bb10129 Author: Simon <80467011+sorgfresser@users.noreply.github.com> Date: Tue Jun 10 13:50:34 2025 +0100 Remove device_count (#3587) commit d2e6b0313d696be62fe69d19f15bf3098effbad2 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 10 05:26:48 2025 -0700 [FSDP2] Refactor + FP8 (#3585) * Fix double wrap * Clocking off, ~equal to torch baseline * works? * Working version * Partial rewrite * FSDP2 path works * Fix back prepare * Almost done, proper AC left * Feat: should work, cleanup + test more benchmarks left * Style+quality * Feat: fp8 example * Feat: better example * Feat: add readme * Docs + should be done * Fix: typos * Fix: protect imports * Feat: address comments * Feat: add flops image commit b9fee48c85dc8b3c4db1e97258925660cdc6ee36 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 10 13:24:43 2025 +0100 better handle FP8 with and without deepspeed (#3611) * use the state mixed precision which has undergone all preprocessing * Update src/accelerate/accelerator.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/accelerate/accelerator.py * accelerator state sets the mixed precision for deepspeed and fp8_enabled * fix * fix --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 3a82b056cf85b16976ca2760615897fe65ae5e64 Author: Marc Sun <57196510+SunMarc@users.noreply.github.com> Date: Tue Jun 10 11:29:59 2025 +0200 Fix bf16 training with TP (#3610) * fix * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 6b61a373a2b4e72e3f003ba2277904ee31b9f7e0 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Fri Jun 6 13:48:43 2025 +0100 fix deepspeed regional compilation (#3609) commit 682691deaca2637e0a2efeaa5ebb6dd8bade8c30 Author: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> Date: Tue Jun 3 12:36:56 2025 +0200 Update Gaudi Runners (#3593) * test * fix * push * in the morning * fix backend * run first * set habana modules * dynamo backend * trigger * remove on pr * remove on file change commit 791055b4848d2c11d3dfcd47faba79b524973756 Author: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Date: Tue Jun 3 12:24:20 2025 +0200 Fix: list object has no attribute keys (#3603) commit 16bf1d89016e03f5b0d8545e9883df7fe9ab9b5f Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:36:34 2025 +0800 enable torchao and pippy test cases on XPU (#3599) * enable torchao and pippy test cases on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ab3c604e48619f7cd08cfac46a7c542414b6661f Author: Yao Matrix <matrix.yao@intel.com> Date: Fri May 30 23:23:26 2025 +0800 enable big_model_inference on xpu (#3595) * enable big_model_inference on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix style Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix quality Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 273799c85d849a1954a4f2e65767216eb37fa089 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 20:08:59 2025 +0800 enable fsdp2 benchmark on XPU (#3590) * enable fsdp2 benchmark on XPU Signed-off-by: Matrix YAO <matrix.yao@intel.com> * add deterministic Signed-off-by: Matrix YAO <matrix.yao@intel.com> --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit 43526c5c089cc831530f42bbbe66a0cb0b0ea461 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:44:50 2025 +0800 add device-agnostic GradScaler (#3588) * add device-agnostic GradScaler Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix bug Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix review comments Signed-off-by: Matrix YAO <matrix.yao@intel.com> * fix Signed-off-by: Matrix YAO <matrix.yao@intel.com> * format Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 07f2392f40a92710b4fb7e51b2de1d40f24d44e2 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 27 17:17:18 2025 +0800 change to use torch.device (#3594) Signed-off-by: Matrix YAO <matrix.yao@intel.com> commit ee2f48c2c3d393187408a0f2cce1ece973033809 Author: Fanli Lin <fanli.lin@intel.com> Date: Tue May 27 17:16:42 2025 +0800 [docs] no hard-coded cuda in the ddp documentation (#3589) * make device-agnostic * refactor commit 4f3abb73a722f6275197c060346dd2f385039afc Author: jiqing-feng <jiqing.feng@intel.com> Date: Mon May 26 21:55:10 2025 +0800 Set ccl and KMP param in simple launch (#3575) * Even 1 CPU mechine can also run multi process Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl and kml param setting Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * set master addr only when processes > 1 Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix num process check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix ccl args check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> commit db536cbfeb61a92e642462a436b51104ab96bd2f Author: Yuanzhou Cai <80858000+yuanjua@users.noreply.github.com> Date: Mon May 26 21:08:13 2025 +0800 Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup (#3581) * Fix tracker initialize distributed before InitProcessGroupKwargs * Fix tracker initialize distributed before InitProcessGroupKwargs * Add test for bug #3550 * Improve test for #3550 * Remove redundant code Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix style --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> commit 4e9d0deba6fd759f5f503f9b1587e79c51032a68 Author: Yao Matrix <matrix.yao@intel.com> Date: Mon May 26 21:05:42 2025 +0800 enable regional_compilation benchmark on xpu (#3592) * enable regional_compilation benchmark on xpu Signed-off-by: Matrix YAO <matrix.yao@intel.com> * Apply style fixes --------- Signed-off-by: Matrix YAO <matrix.yao@intel.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 8cb3ace89485af0488d93da6c080c36319cced9e Author: Luiz F. G. dos Santos <luiz.fernando0992@gmail.com> Date: Thu May 22 10:21:54 2025 -0500 Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` (#3540) * Added artifacts and figure tracking at MLFlow tracker * Added `log_artifact` to the MLFlowTracker * Remove changes * Added kwargs when loading state. * added doc string * Adjusted correct default types of kwargs * Changed the load kwargs to a single one * removed None value from kwargs * fix kwargs for loading the model * removed load_kwargs from optimizer state dict * make load_kwargs a dictionary * revert last changes * reverted load_kwargs * fix docstring * added dict initiation * Fix quality error during PR commit b6d97cb856ae0c9daa310ab8305340950ea8763a Author: Emmanuel Ferdman <emmanuelferdman@gmail.com> Date: Thu May 22 17:26:31 2025 +0300 Resolve logger warnings (#3582) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> commit 33967d4733ec5bf402d85462ec2bbbcd8e872ea9 Author: Francesco Laiti <25352428+laitifranz@users.noreply.github.com> Date: Tue May 20 12:29:53 2025 +0200 Add support for standalone mode when default port is occupied on single node (#3576) * add standalone mode and replace ConnectionError with a warning when the main process port is in use, allowing for automatic port selection * address review feedback: warn on port conflict only for single-node; raise error for multi-node * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> commit 5b1fcda371b049f76e1bd8536e114635d9eaf5b3 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:04:24 2025 +0800 enable test_cli & test_example cases on XPU (#3578) * enable test_cli & test_example cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * remove print Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix ci issue Signed-off-by: YAO Matrix <matrix.yao@intel.com> --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> commit f55f0533b5726d85a62fb05760ec6a92d00e0099 Author: Yao Matrix <matrix.yao@intel.com> Date: Tue May 20 18:02:14 2025 +0800 goodbye torch_ccl (#3580) Signed-off-by: Matrix Yao <matrix.yao@intel.com> commit 1ec99f0b5842f2f246b6481248099920e74f6384 Author: Yao Matrix <yaoweifeng0301@126.com> Date: Mon May 19 17:27:40 2025 +0800 enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU (#3579) * enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU Signed-off-by: Matrix Yao <matrix.yao@intel.com> * fix style Signed-off-by: Matrix Yao <matrix.yao@intel.com> * Update test_load_checkpoint_and_dispatch_with_broadcast.py --------- Signed-off-by: Matrix Yao <matrix.yao@intel.com>

Author

S1ro1

Committer

S1ro1

Parents

f8bac5aa

accelerate 27edf352 - Squashed commit of the following:

accelerate
27edf352 - Squashed commit of the following: