Remove needs_dynamic_casting from TensorIterator and move it to Loops.cuh (#32755)
Summary:
Remove `needs_dynamic_casting` from TensorIterator and move it to `Loops.cuh`.
The original design of `needs_dynamic_casting` is fundamentally flawed: it injects logics into TensorIterator and uses a bunch of boolean values to test whether the dynamic casting is needed. This makes it very fragile, as the TensorIterator is so complicated and it is easy to introduce unnecessary dynamic casts. It also makes the `gpu_kernel` very unflexible, differently cases needs to manipulate TensorIterator to make it work.
For example, currently
```python
torch.zeros(10, device='cuda').mul_(0.9)
```
needs dynamic cast, but it shouldn't.
Testing whether dynamic casting is needed could be easy: just compare the dtypes of the lambda with the dtypes of operands. If they don't match, then dynamically cast, otherwise don't cast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32755
Differential Revision: D19644092
Pulled By: ngimel
fbshipit-source-id: 130bb8bd78d20c2ed1bdfc9d9fb451eb0f0c7e55