Enable FSDP & Deepspeed + FP8 (#2983)
* Working version rebased from main
* kwargs
* Clean
* Fix more nits
* Fin
* Delay autocast flag
* Enable FP8 autocast during eval only if specified
* Fin
* Rm comment
* All done
* Zero3 works!
* Let the wrapper come off during unwrap_model
* Add import check
* Migrate all to benchmarks folder and make TE import check work
* Add readme
* Add README to benchmarks folder
* Update CLI to now include fp8 args
* Add test config for 0_34
* Finish adding to config yaml
* Write docs
* Expound docs w/ FP8
* Add to toctree