Make the reduction ops more consistent in checking if no transpose is required and skipping the copy of the input data if that is the case. Significantly better performance when this is done (2x faster for model calling ReduceSumSquare with input of {2048,10}). (#3265)