Improve checkpointing for Zero stage 1 (#5478)
* Initial running changes
* Checkpointing aggregation changes
* compare with older version
* initial cleanup
* Add zero test, minor fix
* Fix zero test, transform, formatting
* Review comments
* add more unit tests
* review comments
* Try fix CI
* Add additional check on just aggregation code
* Try fix ckpt gen
* Add pregenerated ckpt for CI, enable zero test in e2e
* Moving test to nightly, removing ckpt files
* Add tests to dist GPU CI
* Fix dist test
* Review comments
* Fix test