fix division by low precision scalar (#41446)
Summary:
Before, inverse for division by scalar is calculated in the precision of the non-scalar operands, which can lead to underflow:
```
>>> x = torch.tensor([3388.]).half().to(0)
>>> scale = 524288.0
>>> x.div(scale)
tensor([0.], device='cuda:0', dtype=torch.float16)
>>> x.mul(1. / scale)
tensor([0.0065], device='cuda:0', dtype=torch.float16)
```
This PR makes results of multiplication by inverse and division the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41446
Reviewed By: ezyang
Differential Revision: D22542872
Pulled By: ngimel
fbshipit-source-id: b60e3244809573299c2c3030a006487a117606e9