fix multi_output_kernel (#51827)

Commit

3 years ago

fix multi_output_kernel (#51827) Summary: With zasdfgbnm's help and with his small TensorIterator kernel repro https://github.com/zasdfgbnm/tensoriterator we've found a workaround for what looks like a compiler bug in multi_output_kernel that manifests itself with cuda 10.2 and cuda 11 when there is a non-trivial OffsetCalculator. It looks like those nvcc versions cannot handle inheritance in device structs, so instead of inheriting `multi_outputs_unroll` from `unroll` we make it independent. cc vkuzo, haichuan-fb I verified that reverting https://github.com/pytorch/pytorch/issues/49315 to bring back multi_output_kernel and running `test_learnable_backward_per_channel_cuda` test passes, but I didn't do it in this PR - can you take it up as a follow-up? Pull Request resolved: https://github.com/pytorch/pytorch/pull/51827 Reviewed By: izdeby Differential Revision: D26305559 Pulled By: ngimel fbshipit-source-id: 1168e7c894d237a954abfd1998eaad54f0ce40a7

Author

Natalia Gimelshein

Committer

facebook-github-bot

Parents

21dccbca

pytorch d9e67507 - fix multi_output_kernel (#51827)

pytorch
d9e67507 - fix multi_output_kernel (#51827)