uv
567468ce - More efficient cache-key globbing + support parent paths in globs (#13469)

Commit
154 days ago
More efficient cache-key globbing + support parent paths in globs (#13469) ## Summary (Related PR: #13438 - would be nice to have it merged as well since it touches on the same globwalker code) There's a few issues with `cache-key` globs, which this PR attempts to address: - As of the current state, parent or absolute paths are not allowed, which is not obvious and is not documented. E.g., cache-key paths of the form `{file = "../dep/**"}` will be essentially ignored. - Absolute glob patterns also don't work (funnily enough, there's logic in `globwalk` itself that attempts to address it in [`globwalk::glob_builder()`](https://github.com/Gilnaa/globwalk/blob/8973fa2bc560be54c91448131238fa50d56ee121/src/lib.rs#L415), which serves as inspiration to some parts of this PR). - The reason for parent paths being ignored is the way globwalker is currently being triggered in `uv-cache-info`: the base directory is being walked over completely and each entry is then being matched to one of the provided match patterns. - This may also end up being very inefficient if you have a huge root folder with thousands of files: if your match patterns are `a/b/*.rs` and `a/c/*.py` then instead of walking over the root directory, you can just walk over `a/b` and `a/c` and match the relevant patterns there. - Why supporting parent paths may be important to the point of being a blocker: in large codebases with python projects depending on other local non-python projects (e.g. rust crates), cache-keys can be very useful to track dependency on the source code of the latter (e.g. `cache-keys = [{ file = "../../crates/some-dep/**" }]`. - TLDR: parent/absolute cache-key globs don't work, glob walk can be slow. ## Solution - In this PR, user-provided glob patterns are first clustered (LCP-style) into pattern groups with longest common path prefix; each of these groups can then be walked over separately. - Pattern groups do not overlap, so we would never walk over the same directory twice (unless there's symlinks pointing to same folders). - Paths are not canonicalized nor virtually normalized (which is impossible on Unix without FS access), so the method is symlink-safe (i.e. we don't treat `a/b/..` as `a`) and should work fine with #13438. - Because of LCP logic, the minimal amount of directory space will be traversed to cover all patterns. - Absolute glob patterns will now work. - Parent-relative glob patterns will now work. - Glob walking will be more efficient in some cases. ## Possible improvements - Efficiency can be further greatly improved if we limit max depth for globwalk. Currently, a simple ".toml" will deep-traverse the whole folder. Essentially, max depth can be always set to either N or infinity. If a pattern at a pivot node contains `**`, we collect all children nodes from the subtree into the same group and don't limit max depth; otherwise, we set max depth to the length of the glob pattern. This wouldn't change correctness though and can we done separately if needed. - If this is considered important enough, docs can be updated to indicate that parent and absolute globs are supported (and symlinks are resolved, if the relevant PR is algo merged in). ## Test Plan - Glob splitting and clustering tests are included in the PR. - Relative and absolute glob cache-keys were tested in an actual codebase.
Author
Parents
Loading