[core] Refactor the Cache logic to make it simpler and more general (#39797)
* Simplify the logic quite a bit
* Update cache_utils.py
* continue work
* continue simplifying a lot
* style
* Update cache_utils.py
* offloading much simpler
* style
* Update cache_utils.py
* update inits
* Update cache_utils.py
* consistemncy
* Update cache_utils.py
* update generate
* style
* fix
* fix
* add early_initialization
* fix
* fix mamba caches
* update
* fix
* fix
* fix
* fix tests
* fix configs
* revert
* fix tests
* alright
* Update modeling_gptj.py
* fix the constructors
* cache tests
* Update test_cache_utils.py
* fix
* simplify
* back to before -> avoid compile bug
* doc
* mistral test
* llama4 test dtype
* Update test_modeling_llama4.py
* CIs
* Finally find a nice impl
* Update cache_utils.py
* Update cache_utils.py
* add lazy methods in autodoc
* typo
* better doc
* Add detailed docstring for lazy init
* CIs
* style
* fix