add dequantize support for fp16 + cuda (#67234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67234
Extends the dequantize fp16 function to also work on CUDA,
and adds a test.
Test Plan:
```
python test/test_quantization.py TestQuantizedTensor.test_dequantize_fp16_cuda
python test/test_quantization.py TestQuantizedTensor.test_dequantize_fp16_cpu
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D31915330
fbshipit-source-id: 622d47464fae26bf02f295ff56df63a3bf80b786