[TensorRT EP] Fix bug for DDS output handling for empty tensor (#19575)
When the DDS output is empty tensor (i.e. any of the dimension is 0),
TRT EP won't perform either cudaMemcpyAsync() nor cuda::Impl_Cast(), to
prevent accidentally overwriting other location that might belong to
other tensors.
This PR also refactors the code to only allocate single bytes for all
empty tensors.
#TODO: add unit tests to cover the DDS code paths or doing more testing
with concurrent,sequential, threaded faster-rcnn using onnx_test_runner
and verifying outputs
---------
Co-authored-by: Chi Lo <lochi@microsoft.com>