Backport cog-runtime (#2583)

Commit

52 days ago

Backport cog-runtime (#2583) * initial commit of imported cog-runtime repo * Integrate cog-runtime (coglet) into main repository Move coglet runtime code from standalone cog-runtime repository into coglet/ directory: - coglet/cmd/coglet-server/ - Go binary for container runtime - coglet/internal/ - Go server packages (runner, server, webhook, etc.) - coglet/python/coglet/ - Python SDK - coglet/python/cog/ - Compatibility shim - coglet/pyproject.toml - Python package config Changes: - Bump go.mod to go 1.25 (required for sync.WaitGroup.Go, os.Root features) - Add Makefile targets: coglet-wheel, coglet-server-binaries, test-coglet-* - Add CI jobs for coglet Go and Python tests - Update import paths from github.com/replicate/cog-runtime/internal/* to github.com/replicate/cog/coglet/internal/* Original code from: https://github.com/replicate/cog-runtime Phase 1 of cog-runtime integration (cog-l09 epic, cog-4gd task) * fix: remove readme field from coglet pyproject.toml The readme = '../README.md' path breaks CI because setuptools doesn't allow referencing files outside the package directory. * feat: use uv run for coglet Go tests instead of manual venv Replace PythonBinPath config with PythonCommand []string to support flexible Python invocation. This allows tests to use 'uv run --project' which automatically manages the Python environment. Changes: - config: Replace PythonBinPath string with PythonCommand []string - manager: Add buildPythonCmd() helper for command construction - harness_test: Use 'uv run --project' for both coglet and legacy cog tests - Remove manual .venv path construction and PYTHONPATH manipulation - Add coglet/uv.lock for reproducible test environments - Ignore auto-generated _version.py files from setuptools-scm This eliminates the need for manual venv setup before running tests. * fix: pre-create coglet venv in CI to avoid parallel uv sync The coglet Go tests use 'uv run --project coglet' which creates a venv on first run. When multiple tests start in parallel, they may all try to create the venv simultaneously, causing hangs or timeouts. Pre-running 'uv sync' ensures the venv exists before tests start. * fix: exclude coglet/ from test-go target to avoid CI hangs The test-go CI job was including coglet/internal/tests which requires uv to be set up for Python environment management. Since test-coglet-go already runs these tests with proper uv setup, exclude all coglet/ packages from test-go to avoid 20-minute hangs in CI. * fix: clean up pending map before sending webhook to avoid race condition When a prediction completes, the terminal webhook was sent before the pending map entry was deleted. This caused a race condition where a webhook receiver starting a new prediction would see the runner as having no capacity (pending entry still exists), leading to 500 errors in sequential prediction scenarios. Reorder operations to delete from pending map first, then send webhook. This ensures findRunnerWithCapacity sees accurate capacity when new predictions arrive. * fix: limit coglet test parallelism to avoid resource exhaustion The coglet tests spawn Python subprocesses for each test case. Running too many in parallel causes resource exhaustion (OOM kill) in CI. Limit parallelism to 4 to prevent this. * fix: resolve golangci-lint errors in coglet code - Add gosec nolint directives for trusted subprocess and HTTP calls - Refactor handlePath() to use type switch instead of if-else chain - Fix regex match validation to check len() instead of nil * fix: prevent test cleanup from killing test process group The killAllChildProcesses() function was using pgrep -f "coglet" which matched the test binary path itself. When combined with syscall.Kill(-pid, SIGKILL) which kills entire process groups, this was terminating the test process during cleanup. Added checks to: - Skip processes in the same process group as the test - Skip the parent process (ourPpid) Also restored the original gotestsum test runner format. * fix: add CodeQL security annotations for coglet Add #nosec annotations to suppress CodeQL false positives for: - G304 (path traversal): procedure.go:62, runner.go:866, runner.go:873 - G107 (SSRF): procedure.go:79, webhook.go:57 These are intentional behaviors in the coglet runtime which runs in isolated containers with inputs from trusted orchestration systems. Added TODO[md] comments for future validation improvements. * chore: simplify SSRF TODO annotations in coglet Remove #nosec annotations (CodeQL doesn't use them) and simplify TODO comments for future SSRF protection work. These alerts will be dismissed in GitHub UI since URLs come from trusted orchestration layer. * fix: use os.Root API for traversal-safe file operations in coglet Refactor coglet path operations to use Go 1.24's os.Root API to prevent path traversal attacks. This addresses CodeQL path injection alerts. Changes: - Add workingRoot field to RunnerContext for scoped file operations - Add WriteFile/StatFile helper methods with fallback for tests - Initialize workingRoot in manager.go when creating runners - Close workingRoot in RunnerContext.Cleanup() - Add path validation for file:// URLs in procedure.go using filepath.Clean and filepath.EvalSymlinks The os.Root API ensures file operations cannot escape the working directory even with malicious path inputs like '../../../etc/passwd'. * Revert "fix: use os.Root API for traversal-safe file operations in coglet" This reverts commit ac6e3ba5413a424cdd5fc12b8b057849a1c32607. * fix: add CodeQL suppression comments and config for path injection alerts Add inline suppression comments with detailed rationale for known false positive path injection alerts in coglet: - runner.go: requestPath is constructed from controlled workingdir and internally-generated prediction ID (not user input) - procedure.go: file:// URLs only used in dev/testing, production uses http/https from trusted sources Also adds .github/codeql-config.yml documenting these decisions and excluding test fixtures from analysis. * Revert "fix: add CodeQL suppression comments and config for path injection alerts" This reverts commit 798c412e348f7d407f8c376d3008949cf938e456. --------- Co-authored-by: Michael Dwan <mdwan@cloudflare.com>

References

#2583 - Backport cog-runtime

Author

michaeldwan

Parents

27e8ffc5

cog 3b18cc4e - Backport cog-runtime (#2583)

cog
3b18cc4e - Backport cog-runtime (#2583)