Bilinear Upsampling increased throughput (#19306)
Summary:
changed `UpsampleBilinearKernel` s.t. the throughput increased 40~50%.
I tested locally with my local test code -- **not pytorch's provided test code** -- because I am having a build problem ( which I made an issue about [here](https://github.com/pytorch/pytorch/issues/19184)). I tested with various tensor sizes and across all the sizes, it should a significant increase in throughput.
1. added `__restrict__`
2. instead of launch as many threads as there are output elements, I launched only `output_height * output_width` may threads and had each thread iterate through the channel and batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19306
Differential Revision: D15701840
Pulled By: ezyang
fbshipit-source-id: 53c54d4f4e4a28b58ecc7d7ae6b864cbfc760e27