Fix NonZero CUDA kernel to check kernel launch errors via cudaGetLastError()
NonZeroCountEachBlock and NonZeroOutputPositions unconditionally returned
cudaSuccess after CUDA kernel launches. This swallowed any launch errors and
left them in the CUDA runtime error state, where subsequent CUB DeviceScan
calls picked them up as confusing cudaErrorInvalidDevice (101) errors.
Replace return cudaSuccess with return cudaGetLastError() to properly
detect and propagate kernel launch failures, matching the pattern used
by other CUDA kernel wrappers in the codebase.
Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/1c0b0b59-00b3-481b-af23-4aa8989035fd
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>