julia
49000047 - Very WIP: Architecture for robust cancellation

Commit

74 days ago

Very WIP: Architecture for robust cancellation This commit is a first sketch for what I would like to do for robust cancellation (i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR, but I think I've done enough of the design for a design discussion. The first thing I should say is that the goals of this PR is very narrowly to make ^C work well. As part of that, we're taking a bit of a step towards structured concurrency, but I am not intending this PR to be a full implementation of that. Given that some of this has been beaten to death in previous issues, I will also not do my usual motivation overview, instead jumping straight into the implementation. As I said, the motivation is just to make ^C work reliably at this point. Broadly when we're trying to cancel a task, it'll be in one of two broad categories: 1. Waiting for some other operation to complete (e.g. an IO operation, another task, an external event, etc.). Here, the actual cancellation itself is not so difficult (after all the task is not running, but suspended in a somehwat well-defined place). However, robust cancellation requires us to potentially propagate the cancellation signal down the wait tree, since the operation we actually want to cancel may not be the root task, but may instead be some operation being performed by the task we're waiting on (and we'd prefer not to leak those operations and have rogue tasks going around performing potentially side-effecting operations). 2. Currently running and doing some computation. The core problem is not really one of propagation (after all the long-running computation is probably what we're wanting to cancel), but rather how to do the cancellation without state corruption. A lot of the crashiness of our existing ^C implementation is just that we would simply inject an exception in places that are not expecting to handle it. For a full solution to the problem, we need to have an answer for both of these points. I will begin with the second, since the first builds upon it. This PR introduces the concept of a `cancellation request` and a `cancellation point`. Each task has a `cancellation_request` field that can be set externally (e.g. by ^C). Any task performing computation should regularly check this field and abort its computation if a cancellation request is pending. For this purpose, the PR provides the `@cancel_check` macro. This macro turns a pending cancellation request into a well-modeled exception. Package authors should insert a call to the macro into any long-running loops. However, there is of course some overhead to the check and it is therefor inappropriate for tight inner loops. We attempt to address this with compiler support. Note that this part is currently incompletely implemented, so the following describes the design rather than the current state of the PR. Consider the cancel_check macro: ``` macro cancel_check() quote local req = Core.cancellation_point!() if req !== nothing throw(conform_cancellation_request(req)) end end end ``` where `cancellation_point!` is a new intrinsic that defines a cancellation point. The compiler is semantically permitted to extend the cancellation point across any following effect_free calls (note for transitivity reasons, the effect is not exactly the same, but is morally equivalent). Upon passing a `cancellation_point!`, the system will set the current task's `reset_ctx` to this cancellation point. If a cancellation request occurs before the `reset_ctx` is cleared, the task's execution will be reset to the nearest cancellation point. I proposed this mechanism in #52291. Additionally, the `reset_ctx` can in principle be used to establish scoped cancellation handlers for external C libraries as well although I suspect that there are not many C libraries that are actually reset safe in the required manner (since allocation is not). Note that `cancellation_point!` is also intended to be a yield point in order to faciliate the ^C mechanism described below. However, this is not currently implemented. Turning our attention now to the first of the two cases mentioned above, we tweak the task's existing `queue` reference to become a generic (atomic) "waitee" reference. The queue is required to be obtainable from with object via the new `waitqueue` generic function. To cancel a `waiter` waiting for a waitable `waitee` object, we 1. Set the waiter's cancellation request 2. Load the `waitee` and call a new generic function `cancel_wait!`, which shall do whatever synchronization and internal bookkeeping is required to remove the task from the wait-queue and then resumes the task. 3. The `waiter` resumes in the wait code. It may now decide how and whether to propagate the cancellation to the object it was just waiting on. Note that this may involve re-queing a wait (to wait for the cancellation of `waitee` to complete). The idea here is that this provides a well-defined context for cancellation-propagation logic to run. I wanted to avoid having any cancellation propagation logic run in parallel with actual wait code. How the cancellation propagates is a bit of a policy question and not one that I fully intend to address in this PR. My plan is to implement a basic state machine that works well for ^C (by requesting safe cancellation immediately and then requesting increasingly unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external libraries will want to create their own cancellation request state machines, which the system supports. The implementation is incomplete, so I will not describe it here yet. One may note that there are a significant number of additional fully dynamic dispatches in this scheme (at least `waitqueue` and `cancel_wait!` and possibly in the future). However, note that these dynamic dispatches are confined to the cancellation path, which is not throughput-sensitive (but is latency sensitive). The handling of ^C is delegated to a dedicated task that then gets notified from the signal handler when a SIGINT is received (similar to the existing profile listener) task. There is a little bit of an additional wrinkle in that we need some logic to kick out a computational-task to its nearset cancellation point if we do not have any idle threads. This logic is not yet implemented. ``` julia> sleep(1000) ^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] _trywait(t::Timer) @ Base ./asyncevent.jl:195 [3] wait @ ./asyncevent.jl:204 [inlined] [4] sleep(sec::Int64) @ Base ./asyncevent.jl:322 [5] top-level scope @ REPL[1]:1 julia> function find_collatz_counterexample() i = 1 while true j = i while true @Base.cancel_check j = collatz(j) j == 1 && break j == i && error("$j is a collatz counterexample") end i += 1 end end find_collatz_counterexample (generic function with 1 method) julia> find_collatz_counterexample() ^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] find_collatz_counterexample() @ Main ./REPL[2]:6 [3] top-level scope @ REPL[3]:1 julia> wait(@async sleep(100)) ^CERROR: TaskFailedException Stacktrace: [1] wait(t::Task; throw::Bool) @ Base ./task.jl:367 [2] wait(t::Task) @ Base ./task.jl:360 [3] top-level scope @ REPL[4]:0 [4] macro expansion @ task.jl:729 [inlined] nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./condition.jl:134 [inlined] [2] _trywait(t::Timer) @ Base ./asyncevent.jl:195 [3] wait @ ./asyncevent.jl:204 [inlined] [4] sleep @ ./asyncevent.jl:322 [inlined] [5] (::var"#2#3")() @ Main ./REPL[4]:1 julia> @sync begin @async sleep(100) @async find_collatz_counterexample() end ^CERROR: nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE) Stacktrace: [1] macro expansion @ ./task.jl:1234 [inlined] [2] _trywait(t::Timer) @ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195 [3] wait @ ./asyncevent.jl:203 [inlined] [4] sleep @ ./asyncevent.jl:321 [inlined] [5] (::var"#45#46")() @ Main ./REPL[26]:3 ...and 1 more exception. Stacktrace: [1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException) @ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454 [2] sync_end(c::Channel{Any}) @ Base ~/julia-cancel/usr/share/julia/base/task.jl:608 [3] macro expansion @ ./task.jl:663 [inlined] [4] (::var"#43#44")() @ Main ./REPL[5] ``` As noted above, the `@Base.cancel_check` is not intended to be required in the inner loop. Rather, the compiler is expected to extend the cancelation point from the start of the loop to the entire function. However, this is not yet implemented.

References

#60281 - Very WIP: Architecture for robust cancellation

Author

Keno

Committer

Keno

Parents

6911fa0c

julia 49000047 - Very WIP: Architecture for robust cancellation

julia
49000047 - Very WIP: Architecture for robust cancellation