Use CUDA callback to release deferred-release buffers (#12883)
* Use CUDA callback to release deferred-release buffers
Polishment
* Minor improvements.
1. Reorder a if-else so that frequent cases are checked first.
2. More documents.
* Fix tests.
Previously, in CUDAExecutionProvider::OnRunStart, we call
GetPerThreadContext in
auto& current_deferred_release_event = GetPerThreadContext().GetCurrentDeferredReleaseEvent();
so that a CUDAExecutionProvider always owns an active PerThreadContext
and the ReleasePerThreadContext in CUDAExecutionProvider::OnRunEnd
is always valid. However, this isn't true after we drop event-
based deferred-release code, so we need to check if
CUDAExecutionProvider really owns PerThreadContext than call
ReleasePerThreadContext if yes.
* Follow up for AMD GPU and improve CUDA part's return value.