parallel_apply should forward current streams to worker threads (#78824)
#71033 moved test_data_parallel_module et al under `instantiate_device_type_tests`. This had the side effect of now running the tests on a non-default stream. The parallel_apply creates new threads, one per device, but does not forward the thread local current streams from the parent thread. This defaults the new per-device threads to use the null stream. The null stream will not sync with the non-default non-blocking streams, resulting in errors when these tests assert tensors are equal.
CC @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78824
Approved by: https://github.com/pruthvistony, https://github.com/janeyx99