Fix cuda graph capture (#15005)
Fix two issues related to cuda graph capture:
https://github.com/microsoft/onnxruntime/issues/14942 and
https://github.com/microsoft/onnxruntime/issues/15002
Issue 1: Previously, graph capture starts at the second run. However,
memory pattern optimization will allocate memory from the second run,
and cudamalloc is not allowed during graph capture. In this PR, the
graph capture will start graph capture after 2 runs to avoid the issue.
Issue 2: https://github.com/microsoft/onnxruntime/pull/13495 introduced
multiple stream support. But stream cleanup will call
cudaStreamSyncronize which is not allowed in cuda graph capture. In this
PR, we move stream cleanup after cuda graph capture.
Update the squeeze net test model with dynamic axis so that we can test
with larger batch size. Add a test that could reproduce the bug (when
changing min runs from 2 back to 1).