feat: experimental managed weights (#2974)
* feat(weights): managed weights pipeline (config, OCI bundles, store, CLI)
Adds the v1 managed-weights feature:
- pkg/config: `weights` stanza in cog.yaml + JSON schema validation
- pkg/model/weightsource: pluggable Source interface with file:// and hf://
scheme implementations, include/exclude filters, content-addressed
fingerprinting
- pkg/model: weight manifest v1 format, OCI bundle index assembly,
packer plan/execute split, weight pusher with progress reporting
- pkg/weights: WeightStore interface, FileStore (hardlinked extraction
cache), WeightManager, parallel layer puller, mount helper for predict
- pkg/weights/lockfile: weights.lock parsing, generation, source fingerprint
- pkg/paths: cache-dir resolver
- pkg/cli/weights: hidden `cog weights` command group with import, pull,
status subcommands
- pkg/cli/predict: auto-pull and mount managed weights at predict time
- pkg/cli/{build,push}: emit weight manifests as part of the OCI bundle
- pkg/predict: wire WeightManager into the predictor
Tooling additions:
- go.mod: hashicorp/go-retryablehttp for resilient pulls
- mise.toml: goimports + rust-analyzer
- .gitignore: drop legacy weights-gen ignore lines
The `cog weights` command group is hidden, and the bundle format only
activates when cog.yaml declares a `weights` stanza, so existing models
are unaffected.
* test(weights): integration tests and managed-weights example
- integration-tests/tests/weights_*.txtar: end-to-end coverage for
cog weights import, pull, predict, and include/exclude filters
- integration-tests/tests/oci_bundle_*.txtar: bundle build/push/inspect
flows updated for the v1 weight manifest format
- integration-tests/harness: helpers for the new tests
- examples/managed-weights: working reference model with cog.yaml,
predict.py, README, and a checked-in weights.lock for verifying the
pull/predict round trip
* docs(spec): add internal draft spec for managed weights
specs/draft-weights.md is an internal design document for the OCI
weight bundle format. Lives under specs/ rather than docs/ so it is
not part of the published documentation site. `draft-` prefix signals
it is not yet a stable reference.
* refactor(weights): simplify review cleanup
- Parallelize computeLayerDigests and computeInventory file hashing
with errgroup (bounded by GOMAXPROCS)
- Type WeightStatus/LayerStatus as string enums for compile-time safety
- Deduplicate formatDigestShort by delegating to model.ShortDigest
- Remove HasProblems() wrapper, inline !AllReady() at call site
* refactor(weights): remove cog inspect command
The hidden cog inspect command was added during early managed-weights
development to debug bundle pushes. It was never designed for general
release: no docs, no tests beyond a single integration script, and the
output format (text and JSON variants, with a separate raw streaming
mode) accreted features without a coherent target audience.
Removing it lets us also drop pkg/model/index.go (the Index/IndexManifest
parse types only inspect consumed) and a small block in resolver.go that
populated those types when loading bundles. The push-side IndexBuilder
in index_factory.go is unrelated and stays.
PlatformUnknown moves next to Platform in artifact_image.go, where it
belongs structurally. TestModel_IsBundle moves to model_test.go since
index_test.go went away.
* test(weights): drop integration tests for removed commands
Three integration tests asserted on the v0 cog weights {build,push,inspect}
verbs and v0 lockfile schema (dest, digestOriginal fields). Those commands
and that schema are gone in the current managed-weights design; the tests
have been failing in CI on this branch since the v1 rewrite landed.
- weights_build.txtar: tested cog weights build standalone. v1 has no
separate build step (the lockfile is generated as part of import); the
schema fields it asserted on no longer exist.
- weights_push_inspect.txtar: tested the build → push → inspect lifecycle
end-to-end. v1 collapses build+push into cog weights import (covered by
oci_bundle_push.txtar) and replaces inspect with status (covered by
weights_status unit tests).
- oci_bundle_build.txtar: tested cog predict against a pre-built bundled
image. predict.go:193-195 documents that pre-built images are now
opaque to Cog at predict time — that path is explicitly out of scope
for v1.
One real coverage gap surfaces: parseRepoOnly's rejection of tagged
references in cog weights import is no longer asserted anywhere. Worth
adding a small test for it later, but unrelated to this cleanup.
* test(weights): remove unused mock-weights testscript helper
cmdMockWeights generated N random files plus a synthetic weights.lock,
and was registered as the mock-weights testscript command. Originally
useful when weights were single random files, but the v1 design treats
weights more like Hugging Face repos: one or two large shards alongside
several small JSON/text files. A flat 'N random files of size X'
fixture no longer represents the shape we want to test against, and no
existing .txtar invokes mock-weights — every weights-related test seeds
files inline with dd/printf/echo.
If we later need realistic fixtures, the right shape is a templated
hf-style directory with configurable padding for the large files, not
a refactor of this helper. Dropping the dead code now to keep the
harness honest about what it actually provides.
* style: gofmt fixes
Apply gofmt across files that drifted out of canonical formatting.
Mechanical changes only (whitespace, struct tag spacing, table-driven
test alignment). No behavior change. CI's Format Go job was rejecting
these on the branch.
* test(weights): repair integration tests broken by v1 design changes
Five integration tests were failing in CI for substantive reasons (not
just the dead commands cleaned up earlier). Each one tripped on a
specific v1 design choice:
- weights_import_predict, weights_pull_predict, weights_filter: the
cog.yaml templates carried a placeholder image: line, with the test
also appending the real image: via printf at runtime. Two image:
keys breaks YAML parsing. Drop the placeholder; the printf writes
the only one.
- weights_pull, weights_pull_predict: assumed cog weights import
leaves the local cache cold so the subsequent cog weights pull has
work to do. v1's import warms the cache as a side effect (the
cog-i12u guarantee), so pull becomes a no-op. Purge the cache
between import and pull to simulate the realistic 'lockfile checked
in, fresh clone, cold cache' scenario these tests want to cover.
- oci_bundle_push: three issues. (1) source: weights-alpha used the v0
string schema; v1 requires source.uri with a file:// URI. (2) cog
push no longer runs the weight builder implicitly — it requires
weights.lock to already exist, so add an explicit cog weights import
step. (3) the run.cog.reference.type annotation assertion was for an
annotation v1 doesn't emit; replace with run.cog.weight.set-digest,
which v1 does emit on each weight descriptor.
All five verified passing locally with the just-built dist binary
against a real local registry. None of these failures were introduced
by recent slices; they have been failing on the branch since the v1
rewrite landed in 07db2579.
* fix: address security and correctness issues from PR review
- FileSource.Open: use os.DirFS to prevent path traversal at the FS
boundary instead of relying on upstream validation
- FileStore.PutFile: validate expectedSize against bytes written to
detect truncation/corruption
- HFSource: build URLs via path.Join + url.URL instead of fmt.Sprintf
string interpolation; parse baseURL once at construction
- writeLayer: check context cancellation before directory header writes
* fix: prune orphaned entries from weights.lock when config changes
Removing a weight from cog.yaml left its entry in weights.lock.
Since the lockfile is projected into /.cog/weights.json (the runtime
manifest), coglet expected weights that no longer exist.
Add Retain() and PruneLockfile() to the lockfile package. Prune runs
during cog weights import (always, using all config names) and in
Resolver.Build() before the image build so writeRuntimeWeightsManifest
sees a clean lockfile.
Also adds config.WeightNames() helper to avoid duplicating the
name-extraction loop across call sites.
* chore(examples): remove orphaned qwen weight from managed-weights example
The qwen3.6-27b-fp8 weight was commented out in cog.yaml but its
entry (~30GB of layer metadata) persisted in weights.lock. This was
the bug that motivated the pruning fix. Clean up the example to
reflect the actual config.
* feat: detect weights.lock drift and fail build/push/predict/train early
build, push, predict, and train now compare cog.yaml weight declarations
against weights.lock before invoking resolver.Build(). If the lockfile is
stale, missing, or has orphaned entries the command fails immediately with
a clear message listing each mismatch and directing the user to run
'cog weights import'.
The drift detection is split into two layers:
- lockfile.CheckDrift: pure comparison, no I/O, lives in pkg/weights/lockfile
- weights.CheckDrift: loads lockfile, normalizes config, formats errors
* feat: make Resolver.Build read-only, HEAD-check weights during push
Resolver.Build() no longer opens the weight store, ingresses files, or
writes weights.lock. It loads lockfile entries into a new model.Weight
type -- the model's lightweight representation of a managed weight.
BundlePusher.Push() no longer uploads weight layers. It HEAD-checks
each weight manifest by tag (pushed earlier by 'cog weights import'),
gets the descriptor, and assembles the OCI index. Image and weight
HEAD-checks run concurrently.
WeightArtifact/WeightBuilder/WeightPusher remain for the import path.
Also removes dead fields: PushOptions.WeightProgressFn,
BuildOptions.WeightsLockPath.
* chore(examples): add directory listing to predict.py setup for debugging
* fix: URL-escape path segments in HuggingFace URL builder
buildURLWithQuery now percent-encodes each path component individually
so filenames containing #, %, spaces, or other URL-special characters
produce valid URLs. Uses url.URL.RawPath to avoid double-encoding.
Resolves ask-bonk review feedback on PR #2974.
* chore: fix formatting
* fix: remove stale 'Pushing weights' assertion from oci_bundle_push test
Weights are pushed exclusively via 'cog weights import', not during
'cog push'. The push flow only HEAD-checks that weight manifests exist.
* feat(examples): add resnet50 example with HuggingFace managed weights
* fix: replace removed util.SHA256HashFile with local weightsource helper
The deadcode cleanup on main deleted pkg/util/hash.go, breaking
setdigest.go which was the only consumer. Move the SHA-256 file
hashing into weightsource as a package-private sha256File() that
returns the "sha256:<hex>" digest directly. Also fix ruff lint
errors in the resnet example (import order, return annotation, zip
strict).
* fix: address PR #2974 review feedback
- Fix layer key collision: layerKey/lockedLayerKey now use path+digest
(via DirhashPart.String()) instead of digest-only, preventing wrong
layer pairing when files have identical content but different paths.
- Unify config normalization: drift checker now goes through
WeightSpecFromConfig instead of a parallel sortedCopy path, fixing
false drift from whitespace-padded include/exclude patterns.
- Fix --json exit code: 'cog weights status --json' now returns exit 1
when weights aren't ready, matching text mode behavior.
- Verify weights before image push: BundlePusher.Push now HEAD-checks
weight manifests before pushing the image, failing fast without
leaving orphaned images in the registry.
- Implement HF pagination: listTree follows cursor-based Link headers
so large HuggingFace repos return complete file listings.
- Require source in schema: weight entries now require source.uri per
the current design (source was always required in practice).
- Simplify: extract fileSetKey and WeightSpec.ConfigWeight() to
eliminate near-duplicate code and fragile field-by-field copies.
* fix(weights): harden lockfile, push verification, and source ingest
Address findings from a follow-up review pass over the managed-weights
pipeline. The themes are:
- Lockfile durability: atomic write via tempfile+fsync+rename so a
killed import can't leave a half-written lockfile that would block
push/predict/train. Cross-process flock around the import critical
section so concurrent imports serialize instead of last-writer-wins.
Reject duplicate names/targets on parse so hand-edited or merged
lockfiles can't yield non-deterministic Find/Pull/Prepare behavior.
- Push verification by digest: BundlePusher now HEADs repo@digest
rather than the human tag, and cross-checks the returned digest.
Tags are mutable; a registry (or anyone with push access) could
otherwise substitute a different manifest at the recorded tag
between import and push.
- Source ingest hardening: reject non-regular entries (symlinks,
devices, FIFOs, sockets) per spec §1.3 instead of silently skipping
them, since silent skip ships a model missing files the user
expected. Bound HuggingFace inline-file fetches with io.LimitReader
+ size equality so a misconfigured mirror can't stream gigabytes
behind a 1 KB metadata claim.
- Smaller correctness items: propagate I/O errors when draining a
shared reader on already-stored digests; reject target == "/" in
config validation; deterministic orphan iteration in weights_status
(slice, not map); 64-bit invocation IDs in Mounts; defensive empty
Plan.Files guard in manifest assembly; debug-log fast-path
recompute so envelope/packer drift is diagnosable; log unexpected
non-regular tar entries on pull; promote gofrs/flock to a direct
dep.