Shape independent gradient builder for ops requiring broadcast (#4586)
* Adding CPU implementation of BroadcastGradientArgs op
Modify to take shape as input instead of tensor
Cleanup
Correct schema
Corrected kernel, added tests, addressed review comments.
Initial change, to add ReduceSumTraining cpu op
cpu support
Initial changes to gradient builder
Non-empty reduction case passing.
Added exception,test for invalid broadcast,addresed review comments.
Initial change, to add ReduceSumTraining cpu op
cpu support
cuda support + more UTs
on comments + UT
no op support for {} axes with new attr - noop_with_empty_axes
Add noop attribute to ReduceSumTraining use
Add testing for no-shape graph, modify AddSub grad builder, logging.:
MulGrad support
Div support
Expand support
Gemm support
MatMul grad change
Transpose Grad change
BiasGeluGrad change.
Fixes after squash
* Remove logging, add specific exception for shape inference error
* fix build
* Review comments
* Review comments
* Fix windows build
Co-authored-by: Ethan Tao <ettao@microsoft.com>