[feat] implement `record_stream` when using CUDA streams during group offloading (#11081)
* implement record_stream for better performance.
* fix
* style.
* merge #11097
* Update src/diffusers/hooks/group_offloading.py
Co-authored-by: Aryan <aryan@huggingface.co>
* fixes
* docstring.
* remaining todos in low_cpu_mem_usage
* tests
* updates to docs.
---------
Co-authored-by: Aryan <aryan@huggingface.co>