Add device tensor documentation for GPU execution providers (#20837)
This documentation adds documentation on:
- how to allocate CUDA device tensors from C++ and python
- how to use DML device tensors from C++ and python
- it also shows how to leverage existing GPU allocations in ORT
- how to overlap PCI copies and GPU execution using CUDA streams
- how to overlap PCI copies and GPU execution using D3D12 Command Lists
and custom resources
---------
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>