Fix Deferred Release and Add New Test Framework for CUDA EP-specific Tests (#13016)
Since CUDA EP became a shared library, most of internal functions are
not accessible from `onnxruntime_test_all`, we need a new mechanism to
write CUDA EP-specific tests. To this end, this PR introduces a general
infra and an example test for deferred release in CUDA EP. When adding
this test, we also found the current deferred release will cause error
when pinned CPU buffer is not allocated by BFCArena, and this PR also
makes a small fix (see changes in rocm_execution_provider.cc and
cuda_execution_provider.cc).
This PR also fixes a deferred release bug found by new tests.