[quant][core][gpu][feature] Implemented quantized cuda gelu
Summary:
Support for quantized cuda gelu has been provided by using
`dequantize -> fp32 cuda gelu kernel -> quantize`. Mathematically, this
is not equivalent to doing int8 gelu, so we have opted for this approach
for now. It might be possible to write a variant of the int8 gelu that's
equivalent to `dequantize -> fp32 cuda gelu kernel -> quantize`, which
can be a topic for future work.
Test function `test_qgelu` was amended to test gelu for quantized cuda
backends.
Test Plan:
```
python test/test_quantization.py -k test_qgelu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77212
Approved by: https://github.com/jerryzh168