pipe engine _aggregate_total_loss: more efficient loss concatenation (#4327)
* _aggregate_total_loss: more efficient loss concatenation
optimize _aggregate_total_loss function in order to remove dependancy
of copying from device to host and back to device.
This reduce the runtime on the host.
* Fixing the if/else block on which the optimization should take place
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>