fix multi_output_kernel (#51827)
Summary:
With zasdfgbnm's help and with his small TensorIterator kernel repro https://github.com/zasdfgbnm/tensoriterator we've found a workaround for what looks like a compiler bug in multi_output_kernel that manifests itself with cuda 10.2 and cuda 11 when there is a non-trivial OffsetCalculator.
It looks like those nvcc versions cannot handle inheritance in device structs, so instead of inheriting `multi_outputs_unroll` from `unroll` we make it independent.
cc vkuzo, haichuan-fb I verified that reverting https://github.com/pytorch/pytorch/issues/49315 to bring back multi_output_kernel and running `test_learnable_backward_per_channel_cuda` test passes, but I didn't do it in this PR - can you take it up as a follow-up?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51827
Reviewed By: izdeby
Differential Revision: D26305559
Pulled By: ngimel
fbshipit-source-id: 1168e7c894d237a954abfd1998eaad54f0ce40a7
Author
Natalia Gimelshein