[CUDA/ROCm] Remove limitation of BiasAdd (#17848)
Previously, BiasAdd only supports hidden dimensions of 32, 640 and 1280
for stable diffusion. This adds a kernel that could support any number
of channels.
### Motivation and Context
Stable Diffusion XL refiner model uses hidden dimensions of 768 or 1536,
which was not supported in BiasAdd.