cog
19995330 - Fix: cancellation wiring (#2772)

Commit
14 days ago
Fix: cancellation wiring (#2772) * Wire cancellation end-to-end from HTTP to worker subprocess Cancel was accepted by the HTTP layer but never reached the worker. This wires the full path: HTTP cancel/connection-drop → supervisor → orchestrator → ControlRequest::Cancel → worker → PyThreadState_SetAsyncExc. Key changes: - Orchestrator: add cancel_by_prediction_id (resolves pred ID → slot ID) - Supervisor: cancel() delegates to orchestrator via spawned task - SyncPredictionGuard: calls supervisor.cancel() on drop (not just token) - Sync HTTP handler: spawns prediction in background task so slot lifetime is not tied to the HTTP connection (fixes permit leak on disconnect) - cancel.rs: replace SIGUSR1 with PyThreadState_SetAsyncExc for sync cancel (works on any thread, not just main); fix fallback exception to create a proper class, not an instance - worker_bridge: capture py_thread_id at prediction start, use it for cancel_sync_thread(); async cancel still uses future.cancel() - Tests: 5 new Python cancel tests (explicit cancel sync/async, connection-drop), 2 txtar integration tests * Fix sync cancel status: upgrade PyErr to Cancelled when slot is marked cancelled PyThreadState_SetAsyncExc injects CancelationException into the Python thread, but predict_worker catches it as a generic PyErr and returns PredictionError::Failed. After predict_worker returns, check if the slot was marked cancelled and upgrade to PredictionError::Cancelled. Also applies the same fix to the sync train path. * Make CancelationException derive from BaseException and expose via cog SDK CancelationException now derives from BaseException (not Exception) so that bare `except Exception` blocks in user predict code cannot swallow cancellation — matching KeyboardInterrupt and CancelledError. Key changes: - cancel.rs: use pyo3_stub_gen::create_exception! to define a static BaseException subclass, replacing the dynamic builtins.type() approach and the OnceLock<Py<PyAny>> storage - lib.rs: register CancelationException on the coglet._impl module - coglet/__init__.py: re-export CancelationException - cog/exceptions.py: new module re-exporting from coglet with a pure-Python fallback for SDK-only environments - cog/__init__.py: expose as cog.CancelationException - stub_gen.rs: include CancelationException in public re-exports - Regenerated stubs via mise run generate:stubs * Remove dead SIGUSR1 cancellation handler and CANCELABLE flag PyThreadState_SetAsyncExc replaced SIGUSR1 as the sync cancel mechanism but the old handler, CANCELABLE flag, CancelableGuard, and enter_cancelable() were left behind. Nothing sends SIGUSR1 and nothing reads the CANCELABLE flag on the cancel path. Removed: - _sigusr1_handler, install_signal_handler, CANCELABLE, CancelableGuard, enter_cancelable, is_cancelable from cancel.rs - enter_cancelable() calls from worker_bridge.rs and predictor.rs - install_signal_handler() call from lib.rs worker init - _is_cancelable Python getter from Server Also updated is_cancelation_exception in predictor.rs to use the static cancel::CancelationException type instead of a dynamic import from the nonexistent cog.server.exceptions module. * Simplify cog.exceptions: drop fallback class, import directly from coglet The pure-Python fallback CancelationException conflicted with pyright when coglet was installed (two incompatible types with the same name). Since coglet is always present at runtime, just re-export directly. * Document CancelationException in Python SDK and HTTP reference - python.md: add Cancellation section with CancelationException usage, import paths, BaseException semantics, and re-raise requirement - http.md: update cancel endpoint docs to use cog.CancelationException import path (was cog.server.exceptions) and link to new SDK docs - Regenerated llms.txt * Clarify CancelationException is for sync predictors only Async predictors receive asyncio.CancelledError instead. Added a table to python.md making the distinction clear, and updated http.md to mention both exception types. * CI: make lint-python use local coglet wheel when available The typecheck in lint-python needs the coglet wheel to resolve CancelationException (now exported from coglet). Without the local wheel, nox falls back to PyPI which doesn't have unreleased changes. Adds build-rust as an optional dependency and downloads the CogletRustWheel artifact when the rust build succeeded, matching the pattern already used by test-coglet-python. * Remove mypy from stub generation — stub_gen.rs handles everything stub_gen.rs now generates coglet/__init__.pyi directly instead of having mypy stubgen overwrite it. Private submodules like _sdk use relative imports (from . import) so type checkers resolve them via the filesystem rather than through the native _impl module. * Add tests for repeated cancellation idempotency Verify that cancelling the same prediction multiple times doesn't panic or break the server. Covers both busy-loop (immediate cancel at bytecode boundaries) and time.sleep/nanosleep (cancel deferred until the C-level block returns). - Rust unit test: supervisor repeated_cancel_is_idempotent - Python test: parametrized over busy_loop and nanosleep variants - Integration test: repeated cancel with time.sleep predictor * Fix build:cog to produce dev version for local wheel auto-detection The build:cog task was setting COG_VERSION to the clean Cargo.toml version (e.g. 0.17.0), which caused isDev=false and skipped local wheel auto-detection in dist/. Remove the override so goreleaser's snapshot template produces a proper dev version (e.g. 0.17.1-dev+g<sha>). * Fix rustfmt formatting in supervisor test
Author
Parents
Loading