Apply GradientCheckpointingLayer to the whole repo (#38913)
* first batch (4)
* align
* altclip
* beit
* bert
* yolos
* dino, pvt_v2
* bark, bart, bert_generation
* big_bird, biogpt
* blnderbot, bloom
* bridgetower
* camambert, canine, chameleon
* chinese clip, clap, clip
* codegen, conditional detr, convbert
* dab_detr, data2vec
* dbrx, deberta
* deberta, decicion_tranformer, deformable_detr
* deit, deta, mctct
* detr, dinov2, distilbert
* donut, dpt, electra
* ernie, esm, falcon
* flava, fnet, falcon_mamba
* focalnet, git, gpt2
* gpt - bigcode, neo, neox
* gptj, groupvit
* idefics2, idefics3
* ijepa, imagegpt, internvl
* jetmoe, kosmos2, layoutlm
* layoutlm2-3, led
* lilt, longformer, longt5, luke
* m2m, mamba1-2
* marian, markuplm, mask2former
* maskformer
* mbart, megatron_bert, mimi
* mixtral, mlcd
* mobilevit1-2, modernbert
* moshi, mpt, mra
* mt5, musicgen
* mvp, nemotron
* nllb_moe
* nystromformer, omdet_turbo
* opt, owlvit, owlv2
* pegasus, pegasus_x, presimmon
* phimoe, pix2struct, pixtral
* plbart, pop2piano, prophetnet
* qwen2*
* qwen2, qwen3 moe, rec gemma
* rembert
* roberta
* roberta prelayernorm
* roc_bert, roformer, rwkv
* sam, sam_hq
* seggpt, smolvlm, speech_to_text
* splinter, stablelm, swin
* swin2sr, switch_transformer, t5, table_transformer
* tapas, time_series_tranformer, timesformer
* trocr, tvp, umt5
* videomae, vilt, visual_bert
* vit, vit_mae, vit_msn
* vitpose_backbone, vits, vivit
* whisper. x_clip, xglm
* xlm_roberta, xmod
* yoso
* zamba
* vitdet, wav2vec2, wav2vec2_bert
* unispeech, wav2vec_conformer
* wavlm
* speecht5
* swinv2
* sew / _d
* seamless_mt4 / _v2
* deprecated models update
* bros
* gemma2, gemma3
* got, hiera, hubert, llama4, mllama, oneformer, phi, olmoe, informer
* fixup
* Add use_cache=False and past_key_value=None to GradientCheckpointingLayer
* fixup
* fix prophetnet
* fix bigbird_pegasus
* fix blenderbot
* fix mbart
* fix mvp
* fix zamba2
* fix bart
* fix blenderbot_small
* fix codegen
* Update gradient checkpointing layer to support more past_key_values arg names
* fix data2vec vision
* fix deformable_detr
* fix gptj
* fix led
* fix m2m_100
* add comment
* fix nnlb_moe
* Fix pegasus_x
* fix plbart
* udop
* fix-copies: beit, wav2vec2
* fix gpt_bigcode
* fixup
* fix t5
* fix switch_transformers
* fix longt5
* fix mt5
* update tapas
* fix blip2
* update blip
* fix musicgen
* fix gpt2, trocr
* fix copies
* !!! Revert zamba, mllama
* update autoformer
* update bros
* update args / kwargs for BERT and copies
* 2nd round of updates
* update conditional detr
* Pass encoder_hidden_states as positional arg
* Update to pass encoder_decoder_position_bias as positional arg
* fixup
* biogpt modular
* modular gemma2
* modular gemma3
* modular gpt_neox
* modular informer
* modular internvl
* modular mixtral
* modular mlcd
* modular modernbert
* modular phi
* modular qwen2_5_omni
* modular qwen2_5_vl
* modular sam_hq
* modular sew
* wav2vec2_bert
* modular wav2vec2_conformer
* modular wavlm
* fixup
* Update by modular instructblipvideo
* modular data2vec_audio
* nit modular mistral
* apply modular minimax
* fix modular moonshine
* revert zamba2
* fix mask2former
* refactor idefics