Internal ReduceSum op that accepts axes as input (#4522)
* Initial change, to add ReduceSumTraining cpu op
* cpu support
* cuda support + more UTs
* on comments + UT
* no op support for {} axes with new attr - noop_with_empty_axes
* on comments
* fix build
* on comments
Co-authored-by: aishwarya bhandare <aibhanda@microsoft.com>
Co-authored-by: Ethan Tao <ettao@microsoft.com>