Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP (#2655)
* Basic autocasting stuff
* Delay fp8 autocast until after DDP wrapping
* More fixes
* Bookmark: without dtype change
* Bookmark: with dtype changes
* Different alternative, better results
* Didn't matter what order, same result
* Revert + maintain
* Fin
* Refactor based on feedback
* native_amp bool
* Final nits