Add CUDA Scan operator. (#2403)
* Add Scan CUDA op.
Uses CPU implementation for logic.
Added some device specific functors for handling when data needs to be manipulated on a different device.
Added ability to override the materialization logic in the OrtValue slicer so DML can plugin their handling.