Implement some missing element wise Add/Sub/Mul/Div/Neg operations for CPU and CUDA EPs (#23090)
* [CPU EP] Implement Add/Sub/Mul/Div element wise operations for
(u)int8, (u)int16, uint32 and uint64.
* [CPU EP] Implement Neg unary operation for int16
* [CUDA EP] Implement Add/Sub/Mul/Div element wise operations for
(u)int8 and (u)int16
### Motivation and Context
This solves https://github.com/microsoft/onnxruntime/issues/23051