cog
cac3e91a - feat: experimental managed weights (#2974)

Commit
24 days ago
feat: experimental managed weights (#2974) * feat(weights): managed weights pipeline (config, OCI bundles, store, CLI) Adds the v1 managed-weights feature: - pkg/config: `weights` stanza in cog.yaml + JSON schema validation - pkg/model/weightsource: pluggable Source interface with file:// and hf:// scheme implementations, include/exclude filters, content-addressed fingerprinting - pkg/model: weight manifest v1 format, OCI bundle index assembly, packer plan/execute split, weight pusher with progress reporting - pkg/weights: WeightStore interface, FileStore (hardlinked extraction cache), WeightManager, parallel layer puller, mount helper for predict - pkg/weights/lockfile: weights.lock parsing, generation, source fingerprint - pkg/paths: cache-dir resolver - pkg/cli/weights: hidden `cog weights` command group with import, pull, status subcommands - pkg/cli/predict: auto-pull and mount managed weights at predict time - pkg/cli/{build,push}: emit weight manifests as part of the OCI bundle - pkg/predict: wire WeightManager into the predictor Tooling additions: - go.mod: hashicorp/go-retryablehttp for resilient pulls - mise.toml: goimports + rust-analyzer - .gitignore: drop legacy weights-gen ignore lines The `cog weights` command group is hidden, and the bundle format only activates when cog.yaml declares a `weights` stanza, so existing models are unaffected. * test(weights): integration tests and managed-weights example - integration-tests/tests/weights_*.txtar: end-to-end coverage for cog weights import, pull, predict, and include/exclude filters - integration-tests/tests/oci_bundle_*.txtar: bundle build/push/inspect flows updated for the v1 weight manifest format - integration-tests/harness: helpers for the new tests - examples/managed-weights: working reference model with cog.yaml, predict.py, README, and a checked-in weights.lock for verifying the pull/predict round trip * docs(spec): add internal draft spec for managed weights specs/draft-weights.md is an internal design document for the OCI weight bundle format. Lives under specs/ rather than docs/ so it is not part of the published documentation site. `draft-` prefix signals it is not yet a stable reference. * refactor(weights): simplify review cleanup - Parallelize computeLayerDigests and computeInventory file hashing with errgroup (bounded by GOMAXPROCS) - Type WeightStatus/LayerStatus as string enums for compile-time safety - Deduplicate formatDigestShort by delegating to model.ShortDigest - Remove HasProblems() wrapper, inline !AllReady() at call site * refactor(weights): remove cog inspect command The hidden cog inspect command was added during early managed-weights development to debug bundle pushes. It was never designed for general release: no docs, no tests beyond a single integration script, and the output format (text and JSON variants, with a separate raw streaming mode) accreted features without a coherent target audience. Removing it lets us also drop pkg/model/index.go (the Index/IndexManifest parse types only inspect consumed) and a small block in resolver.go that populated those types when loading bundles. The push-side IndexBuilder in index_factory.go is unrelated and stays. PlatformUnknown moves next to Platform in artifact_image.go, where it belongs structurally. TestModel_IsBundle moves to model_test.go since index_test.go went away. * test(weights): drop integration tests for removed commands Three integration tests asserted on the v0 cog weights {build,push,inspect} verbs and v0 lockfile schema (dest, digestOriginal fields). Those commands and that schema are gone in the current managed-weights design; the tests have been failing in CI on this branch since the v1 rewrite landed. - weights_build.txtar: tested cog weights build standalone. v1 has no separate build step (the lockfile is generated as part of import); the schema fields it asserted on no longer exist. - weights_push_inspect.txtar: tested the build → push → inspect lifecycle end-to-end. v1 collapses build+push into cog weights import (covered by oci_bundle_push.txtar) and replaces inspect with status (covered by weights_status unit tests). - oci_bundle_build.txtar: tested cog predict against a pre-built bundled image. predict.go:193-195 documents that pre-built images are now opaque to Cog at predict time — that path is explicitly out of scope for v1. One real coverage gap surfaces: parseRepoOnly's rejection of tagged references in cog weights import is no longer asserted anywhere. Worth adding a small test for it later, but unrelated to this cleanup. * test(weights): remove unused mock-weights testscript helper cmdMockWeights generated N random files plus a synthetic weights.lock, and was registered as the mock-weights testscript command. Originally useful when weights were single random files, but the v1 design treats weights more like Hugging Face repos: one or two large shards alongside several small JSON/text files. A flat 'N random files of size X' fixture no longer represents the shape we want to test against, and no existing .txtar invokes mock-weights — every weights-related test seeds files inline with dd/printf/echo. If we later need realistic fixtures, the right shape is a templated hf-style directory with configurable padding for the large files, not a refactor of this helper. Dropping the dead code now to keep the harness honest about what it actually provides. * style: gofmt fixes Apply gofmt across files that drifted out of canonical formatting. Mechanical changes only (whitespace, struct tag spacing, table-driven test alignment). No behavior change. CI's Format Go job was rejecting these on the branch. * test(weights): repair integration tests broken by v1 design changes Five integration tests were failing in CI for substantive reasons (not just the dead commands cleaned up earlier). Each one tripped on a specific v1 design choice: - weights_import_predict, weights_pull_predict, weights_filter: the cog.yaml templates carried a placeholder image: line, with the test also appending the real image: via printf at runtime. Two image: keys breaks YAML parsing. Drop the placeholder; the printf writes the only one. - weights_pull, weights_pull_predict: assumed cog weights import leaves the local cache cold so the subsequent cog weights pull has work to do. v1's import warms the cache as a side effect (the cog-i12u guarantee), so pull becomes a no-op. Purge the cache between import and pull to simulate the realistic 'lockfile checked in, fresh clone, cold cache' scenario these tests want to cover. - oci_bundle_push: three issues. (1) source: weights-alpha used the v0 string schema; v1 requires source.uri with a file:// URI. (2) cog push no longer runs the weight builder implicitly — it requires weights.lock to already exist, so add an explicit cog weights import step. (3) the run.cog.reference.type annotation assertion was for an annotation v1 doesn't emit; replace with run.cog.weight.set-digest, which v1 does emit on each weight descriptor. All five verified passing locally with the just-built dist binary against a real local registry. None of these failures were introduced by recent slices; they have been failing on the branch since the v1 rewrite landed in 07db2579. * fix: address security and correctness issues from PR review - FileSource.Open: use os.DirFS to prevent path traversal at the FS boundary instead of relying on upstream validation - FileStore.PutFile: validate expectedSize against bytes written to detect truncation/corruption - HFSource: build URLs via path.Join + url.URL instead of fmt.Sprintf string interpolation; parse baseURL once at construction - writeLayer: check context cancellation before directory header writes * fix: prune orphaned entries from weights.lock when config changes Removing a weight from cog.yaml left its entry in weights.lock. Since the lockfile is projected into /.cog/weights.json (the runtime manifest), coglet expected weights that no longer exist. Add Retain() and PruneLockfile() to the lockfile package. Prune runs during cog weights import (always, using all config names) and in Resolver.Build() before the image build so writeRuntimeWeightsManifest sees a clean lockfile. Also adds config.WeightNames() helper to avoid duplicating the name-extraction loop across call sites. * chore(examples): remove orphaned qwen weight from managed-weights example The qwen3.6-27b-fp8 weight was commented out in cog.yaml but its entry (~30GB of layer metadata) persisted in weights.lock. This was the bug that motivated the pruning fix. Clean up the example to reflect the actual config. * feat: detect weights.lock drift and fail build/push/predict/train early build, push, predict, and train now compare cog.yaml weight declarations against weights.lock before invoking resolver.Build(). If the lockfile is stale, missing, or has orphaned entries the command fails immediately with a clear message listing each mismatch and directing the user to run 'cog weights import'. The drift detection is split into two layers: - lockfile.CheckDrift: pure comparison, no I/O, lives in pkg/weights/lockfile - weights.CheckDrift: loads lockfile, normalizes config, formats errors * feat: make Resolver.Build read-only, HEAD-check weights during push Resolver.Build() no longer opens the weight store, ingresses files, or writes weights.lock. It loads lockfile entries into a new model.Weight type -- the model's lightweight representation of a managed weight. BundlePusher.Push() no longer uploads weight layers. It HEAD-checks each weight manifest by tag (pushed earlier by 'cog weights import'), gets the descriptor, and assembles the OCI index. Image and weight HEAD-checks run concurrently. WeightArtifact/WeightBuilder/WeightPusher remain for the import path. Also removes dead fields: PushOptions.WeightProgressFn, BuildOptions.WeightsLockPath. * chore(examples): add directory listing to predict.py setup for debugging * fix: URL-escape path segments in HuggingFace URL builder buildURLWithQuery now percent-encodes each path component individually so filenames containing #, %, spaces, or other URL-special characters produce valid URLs. Uses url.URL.RawPath to avoid double-encoding. Resolves ask-bonk review feedback on PR #2974. * chore: fix formatting * fix: remove stale 'Pushing weights' assertion from oci_bundle_push test Weights are pushed exclusively via 'cog weights import', not during 'cog push'. The push flow only HEAD-checks that weight manifests exist. * feat(examples): add resnet50 example with HuggingFace managed weights * fix: replace removed util.SHA256HashFile with local weightsource helper The deadcode cleanup on main deleted pkg/util/hash.go, breaking setdigest.go which was the only consumer. Move the SHA-256 file hashing into weightsource as a package-private sha256File() that returns the "sha256:<hex>" digest directly. Also fix ruff lint errors in the resnet example (import order, return annotation, zip strict). * fix: address PR #2974 review feedback - Fix layer key collision: layerKey/lockedLayerKey now use path+digest (via DirhashPart.String()) instead of digest-only, preventing wrong layer pairing when files have identical content but different paths. - Unify config normalization: drift checker now goes through WeightSpecFromConfig instead of a parallel sortedCopy path, fixing false drift from whitespace-padded include/exclude patterns. - Fix --json exit code: 'cog weights status --json' now returns exit 1 when weights aren't ready, matching text mode behavior. - Verify weights before image push: BundlePusher.Push now HEAD-checks weight manifests before pushing the image, failing fast without leaving orphaned images in the registry. - Implement HF pagination: listTree follows cursor-based Link headers so large HuggingFace repos return complete file listings. - Require source in schema: weight entries now require source.uri per the current design (source was always required in practice). - Simplify: extract fileSetKey and WeightSpec.ConfigWeight() to eliminate near-duplicate code and fragile field-by-field copies. * fix(weights): harden lockfile, push verification, and source ingest Address findings from a follow-up review pass over the managed-weights pipeline. The themes are: - Lockfile durability: atomic write via tempfile+fsync+rename so a killed import can't leave a half-written lockfile that would block push/predict/train. Cross-process flock around the import critical section so concurrent imports serialize instead of last-writer-wins. Reject duplicate names/targets on parse so hand-edited or merged lockfiles can't yield non-deterministic Find/Pull/Prepare behavior. - Push verification by digest: BundlePusher now HEADs repo@digest rather than the human tag, and cross-checks the returned digest. Tags are mutable; a registry (or anyone with push access) could otherwise substitute a different manifest at the recorded tag between import and push. - Source ingest hardening: reject non-regular entries (symlinks, devices, FIFOs, sockets) per spec §1.3 instead of silently skipping them, since silent skip ships a model missing files the user expected. Bound HuggingFace inline-file fetches with io.LimitReader + size equality so a misconfigured mirror can't stream gigabytes behind a 1 KB metadata claim. - Smaller correctness items: propagate I/O errors when draining a shared reader on already-stored digests; reject target == "/" in config validation; deterministic orphan iteration in weights_status (slice, not map); 64-bit invocation IDs in Mounts; defensive empty Plan.Files guard in manifest assembly; debug-log fast-path recompute so envelope/packer drift is diagnosable; log unexpected non-regular tar entries on pull; promote gofrs/flock to a direct dep.
Author
Parents
Loading