Add per-WrapperLinear activation checkpointing for memory reduction
Enable `--enable_activation_checkpointing` to reduce peak GPU memory
during tuning by wrapping each WrapperLinear forward in
torch.utils.checkpoint. During backward, only one layer's QDQ
intermediates are recomputed at a time instead of all layers' being
held simultaneously.
On Qwen3-30B-A3B (128-expert MoE, MXFP8, 10 iters) this cuts peak
VRAM from ~80 GB to ~13 GB (85% reduction) with ~3.5% time overhead
and identical quantization quality.
Key changes:
- wrapper.py: WrapperLinear gains enable_activation_checkpointing;
forward() dispatches to _checkpointed_forward -> _forward_impl
- compressors/base.py: passes flag through wrapper_block() call
- compressors/config.py: add to ExtraConfig + TuningExtraConfig
- autoround.py: add to AutoRound.__new__() signature
- __main__.py: add --enable_activation_checkpointing CLI flag
- compressors/utils.py: block_forward_with_activation_checkpointing
helper (kept for optional manual use)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>