Fix: cancellation wiring (#2772)
* Wire cancellation end-to-end from HTTP to worker subprocess
Cancel was accepted by the HTTP layer but never reached the worker.
This wires the full path: HTTP cancel/connection-drop → supervisor →
orchestrator → ControlRequest::Cancel → worker → PyThreadState_SetAsyncExc.
Key changes:
- Orchestrator: add cancel_by_prediction_id (resolves pred ID → slot ID)
- Supervisor: cancel() delegates to orchestrator via spawned task
- SyncPredictionGuard: calls supervisor.cancel() on drop (not just token)
- Sync HTTP handler: spawns prediction in background task so slot lifetime
is not tied to the HTTP connection (fixes permit leak on disconnect)
- cancel.rs: replace SIGUSR1 with PyThreadState_SetAsyncExc for sync
cancel (works on any thread, not just main); fix fallback exception
to create a proper class, not an instance
- worker_bridge: capture py_thread_id at prediction start, use it for
cancel_sync_thread(); async cancel still uses future.cancel()
- Tests: 5 new Python cancel tests (explicit cancel sync/async,
connection-drop), 2 txtar integration tests
* Fix sync cancel status: upgrade PyErr to Cancelled when slot is marked cancelled
PyThreadState_SetAsyncExc injects CancelationException into the Python
thread, but predict_worker catches it as a generic PyErr and returns
PredictionError::Failed. After predict_worker returns, check if the
slot was marked cancelled and upgrade to PredictionError::Cancelled.
Also applies the same fix to the sync train path.
* Make CancelationException derive from BaseException and expose via cog SDK
CancelationException now derives from BaseException (not Exception) so
that bare `except Exception` blocks in user predict code cannot
swallow cancellation — matching KeyboardInterrupt and CancelledError.
Key changes:
- cancel.rs: use pyo3_stub_gen::create_exception! to define a static
BaseException subclass, replacing the dynamic builtins.type() approach
and the OnceLock<Py<PyAny>> storage
- lib.rs: register CancelationException on the coglet._impl module
- coglet/__init__.py: re-export CancelationException
- cog/exceptions.py: new module re-exporting from coglet with a
pure-Python fallback for SDK-only environments
- cog/__init__.py: expose as cog.CancelationException
- stub_gen.rs: include CancelationException in public re-exports
- Regenerated stubs via mise run generate:stubs
* Remove dead SIGUSR1 cancellation handler and CANCELABLE flag
PyThreadState_SetAsyncExc replaced SIGUSR1 as the sync cancel mechanism
but the old handler, CANCELABLE flag, CancelableGuard, and
enter_cancelable() were left behind. Nothing sends SIGUSR1 and nothing
reads the CANCELABLE flag on the cancel path.
Removed:
- _sigusr1_handler, install_signal_handler, CANCELABLE, CancelableGuard,
enter_cancelable, is_cancelable from cancel.rs
- enter_cancelable() calls from worker_bridge.rs and predictor.rs
- install_signal_handler() call from lib.rs worker init
- _is_cancelable Python getter from Server
Also updated is_cancelation_exception in predictor.rs to use the static
cancel::CancelationException type instead of a dynamic import from the
nonexistent cog.server.exceptions module.
* Simplify cog.exceptions: drop fallback class, import directly from coglet
The pure-Python fallback CancelationException conflicted with pyright
when coglet was installed (two incompatible types with the same name).
Since coglet is always present at runtime, just re-export directly.
* Document CancelationException in Python SDK and HTTP reference
- python.md: add Cancellation section with CancelationException usage,
import paths, BaseException semantics, and re-raise requirement
- http.md: update cancel endpoint docs to use cog.CancelationException
import path (was cog.server.exceptions) and link to new SDK docs
- Regenerated llms.txt
* Clarify CancelationException is for sync predictors only
Async predictors receive asyncio.CancelledError instead. Added a table
to python.md making the distinction clear, and updated http.md to
mention both exception types.
* CI: make lint-python use local coglet wheel when available
The typecheck in lint-python needs the coglet wheel to resolve
CancelationException (now exported from coglet). Without the local
wheel, nox falls back to PyPI which doesn't have unreleased changes.
Adds build-rust as an optional dependency and downloads the
CogletRustWheel artifact when the rust build succeeded, matching
the pattern already used by test-coglet-python.
* Remove mypy from stub generation — stub_gen.rs handles everything
stub_gen.rs now generates coglet/__init__.pyi directly instead of
having mypy stubgen overwrite it. Private submodules like _sdk use
relative imports (from . import) so type checkers resolve them via
the filesystem rather than through the native _impl module.
* Add tests for repeated cancellation idempotency
Verify that cancelling the same prediction multiple times doesn't
panic or break the server. Covers both busy-loop (immediate cancel at
bytecode boundaries) and time.sleep/nanosleep (cancel deferred until
the C-level block returns).
- Rust unit test: supervisor repeated_cancel_is_idempotent
- Python test: parametrized over busy_loop and nanosleep variants
- Integration test: repeated cancel with time.sleep predictor
* Fix build:cog to produce dev version for local wheel auto-detection
The build:cog task was setting COG_VERSION to the clean Cargo.toml
version (e.g. 0.17.0), which caused isDev=false and skipped local
wheel auto-detection in dist/. Remove the override so goreleaser's
snapshot template produces a proper dev version (e.g. 0.17.1-dev+g<sha>).
* Fix rustfmt formatting in supervisor test