Make dynamic casting case also benefit from unrolling (#34749)
Summary:
This is based on https://github.com/pytorch/pytorch/issues/34708, I didn't use stacked diff because is not very convenient for cherry-picking. Please review after https://github.com/pytorch/pytorch/issues/34708 merged.
**Legacy kernels are now completely gone. And the rewrite of GPU loops is done.**
Benchmark shows big improvements in performance on RTX 2080ti:
https://github.com/zasdfgbnm/things/blob/master/2020Q1/benchmark-unroll-with-dyn-casting.ipynb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34749
Differential Revision: D22139597
Pulled By: ngimel
fbshipit-source-id: 5995744c339afee331f15ea2e483c6acf3ce0c62