Add BF16 CUDA version for Gelu-20 ONNX op (#25765)
### Description
This PR adds support for the [Gelu-20
op](https://onnx.ai/onnx/operators/onnx__Gelu.html#gelu-20) in the ONNX
standard to run with bfloat16 precision.
### Motivation and Context
Without this PR, a model generated with a Gelu op under the ONNX domain
using opset 20 gets decomposed into a series of primitive ops. The
following error then occurs when loading a BF16 CUDA version of the
Gemma-3 1B model into an inference session.
```
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Pow(15) node with name ''
```