git/t/t6601-path-walk.sh
Derrick Stolee bff4555767 backfill: add --sparse option
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout. Non-cone mode can describe the included files
using both positive and negative patterns, which changes the possible return
values of path_matches_pattern_list(). Test both kinds of matches for
increased coverage.

To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are

 1. git clone --filter=blob:none --sparse <url>
 2. git sparse-checkout set <dir1> ... <dirN>
 3. git backfill --sparse

For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3          | 110 MB    |       |
| 10K             | 12         | 192 MB    | 17.2s |
| 15K             | 9          | 192 MB    | 15.5s |
| 20K             | 8          | 192 MB    | 15.5s |
| 25K             | 7          | 192 MB    | 14.7s |

This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.

Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:

| Batch Size      | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2          | 1,876 MB  |      |
| 10K             | 11         | 2,187 MB  | 46s  |
| 25K             | 7          | 2,188 MB  | 43s  |
| 50K             | 5          | 2,194 MB  | 44s  |
| 100K            | 4          | 2,194 MB  | 48s  |

This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.

Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03 16:12:42 -08:00

401 lines
11 KiB
Bash
Executable File

#!/bin/sh
TEST_PASSES_SANITIZE_LEAK=true
test_description='direct path-walk API tests'
. ./test-lib.sh
test_expect_success 'setup test repository' '
git checkout -b base &&
# Make some objects that will only be reachable
# via non-commit tags.
mkdir child &&
echo file >child/file &&
git add child &&
git commit -m "will abandon" &&
git tag -a -m "tree" tree-tag HEAD^{tree} &&
echo file2 >file2 &&
git add file2 &&
git commit --amend -m "will abandon" &&
git tag tree-tag2 HEAD^{tree} &&
echo blob >file &&
blob_oid=$(git hash-object -t blob -w --stdin <file) &&
git tag -a -m "blob" blob-tag "$blob_oid" &&
echo blob2 >file2 &&
blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
git tag blob-tag2 "$blob2_oid" &&
rm -fr child file file2 &&
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
git commit --amend -m "first" &&
git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
git tag -a -m "second (under)" second.1 HEAD &&
git tag -a -m "second (top)" second.2 second.1 &&
# Set up file/dir collision in history.
rm a &&
mkdir a &&
echo a >a/a &&
echo bb >left/b &&
git add a left &&
git commit -m "third" &&
git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
git commit -a -m "topic" &&
git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
0:commit::$(git rev-parse base~1)
0:commit::$(git rev-parse base~2)
1:tag:/tags:$(git rev-parse refs/tags/first)
1:tag:/tags:$(git rev-parse refs/tags/second.1)
1:tag:/tags:$(git rev-parse refs/tags/second.2)
1:tag:/tags:$(git rev-parse refs/tags/third)
1:tag:/tags:$(git rev-parse refs/tags/fourth)
1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
3:tree::$(git rev-parse topic^{tree})
3:tree::$(git rev-parse base^{tree})
3:tree::$(git rev-parse base~1^{tree})
3:tree::$(git rev-parse base~2^{tree})
3:tree::$(git rev-parse refs/tags/tree-tag^{})
3:tree::$(git rev-parse refs/tags/tree-tag2^{})
4:blob:a:$(git rev-parse base~2:a)
5:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
6:tree:a/:$(git rev-parse base:a)
7:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
8:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
9:tree:left/:$(git rev-parse base:left)
9:tree:left/:$(git rev-parse base~2:left)
10:blob:left/b:$(git rev-parse base~2:left/b)
10:blob:left/b:$(git rev-parse base:left/b)
11:tree:right/:$(git rev-parse topic:right)
11:tree:right/:$(git rev-parse base~1:right)
11:tree:right/:$(git rev-parse base~2:right)
12:blob:right/c:$(git rev-parse base~2:right/c)
12:blob:right/c:$(git rev-parse topic:right/c)
13:blob:right/d:$(git rev-parse base~1:right/d)
blobs:10
commits:4
tags:7
trees:13
EOF
test_cmp_sorted expect out
'
test_expect_success 'indexed objects' '
test_when_finished git reset --hard &&
# stage change into index, adding a blob but
# also invalidating the cache-tree for the root
# and the "left" directory.
echo bogus >left/c &&
git add left &&
test-tool path-walk -- --indexed-objects >out &&
cat >expect <<-EOF &&
0:blob:a:$(git rev-parse HEAD:a)
1:blob:left/b:$(git rev-parse HEAD:left/b)
2:blob:left/c:$(git rev-parse :left/c)
3:blob:right/c:$(git rev-parse HEAD:right/c)
4:blob:right/d:$(git rev-parse HEAD:right/d)
5:tree:right/:$(git rev-parse topic:right)
blobs:5
commits:0
tags:0
trees:1
EOF
test_cmp_sorted expect out
'
test_expect_success 'branches and indexed objects mix well' '
test_when_finished git reset --hard &&
# stage change into index, adding a blob but
# also invalidating the cache-tree for the root
# and the "right" directory.
echo fake >right/d &&
git add right &&
test-tool path-walk -- --indexed-objects --branches >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
0:commit::$(git rev-parse base~1)
0:commit::$(git rev-parse base~2)
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
2:tree:a/:$(git rev-parse refs/tags/third:a)
3:tree:left/:$(git rev-parse base:left)
3:tree:left/:$(git rev-parse base~2:left)
4:blob:left/b:$(git rev-parse base:left/b)
4:blob:left/b:$(git rev-parse base~2:left/b)
5:tree:right/:$(git rev-parse topic:right)
5:tree:right/:$(git rev-parse base~1:right)
5:tree:right/:$(git rev-parse base~2:right)
6:blob:right/c:$(git rev-parse base~2:right/c)
6:blob:right/c:$(git rev-parse topic:right/c)
7:blob:right/d:$(git rev-parse base~1:right/d)
7:blob:right/d:$(git rev-parse :right/d)
8:blob:a:$(git rev-parse base~2:a)
blobs:7
commits:4
tags:0
trees:10
EOF
test_cmp_sorted expect out
'
test_expect_success 'base & topic, sparse' '
cat >patterns <<-EOF &&
/*
!/*/
/left/
EOF
test-tool path-walk --stdin-pl -- base topic <patterns >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
0:commit::$(git rev-parse base~1)
0:commit::$(git rev-parse base~2)
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
2:blob:a:$(git rev-parse base~2:a)
3:tree:left/:$(git rev-parse base:left)
3:tree:left/:$(git rev-parse base~2:left)
4:blob:left/b:$(git rev-parse base~2:left/b)
4:blob:left/b:$(git rev-parse base:left/b)
blobs:3
commits:4
tags:0
trees:6
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base~1)
0:commit::$(git rev-parse base~2)
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
2:blob:a:$(git rev-parse base~2:a)
3:tree:left/:$(git rev-parse base~2:left)
4:blob:left/b:$(git rev-parse base~2:left/b)
5:tree:right/:$(git rev-parse topic:right)
5:tree:right/:$(git rev-parse base~1:right)
5:tree:right/:$(git rev-parse base~2:right)
6:blob:right/c:$(git rev-parse base~2:right/c)
6:blob:right/c:$(git rev-parse topic:right/c)
7:blob:right/d:$(git rev-parse base~1:right/d)
blobs:5
commits:3
tags:0
trees:7
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
1:tree::$(git rev-parse topic^{tree})
2:blob:a:$(git rev-parse topic:a):UNINTERESTING
3:tree:left/:$(git rev-parse topic:left):UNINTERESTING
4:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
5:tree:right/:$(git rev-parse topic:right)
6:blob:right/c:$(git rev-parse topic:right/c)
7:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:1
tags:0
trees:3
EOF
test_cmp_sorted expect out
'
test_expect_success 'fourth, blob-tag2, not base' '
test-tool path-walk -- fourth blob-tag2 --not base >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
1:tag:/tags:$(git rev-parse fourth)
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
3:tree::$(git rev-parse topic^{tree})
4:blob:a:$(git rev-parse base~1:a):UNINTERESTING
5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
7:tree:right/:$(git rev-parse topic:right)
8:blob:right/c:$(git rev-parse topic:right/c)
9:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:1
tags:1
trees:3
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
cat >expect <<-EOF &&
0:blob:a:$(git rev-parse topic:a):UNINTERESTING
1:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
2:blob:right/c:$(git rev-parse topic:right/c)
3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:0
tags:0
trees:0
EOF
test_cmp_sorted expect out
'
# No, this doesn't make a lot of sense for the path-walk API,
# but it is possible to do.
test_expect_success 'topic, not base, only commits' '
test-tool path-walk --no-blobs --no-trees \
-- topic --not base >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
commits:1
blobs:0
tags:0
trees:0
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic, not base, only trees' '
test-tool path-walk --no-blobs --no-commits \
-- topic --not base >out &&
cat >expect <<-EOF &&
0:tree::$(git rev-parse topic^{tree})
1:tree:left/:$(git rev-parse topic:left):UNINTERESTING
2:tree:right/:$(git rev-parse topic:right)
commits:0
blobs:0
tags:0
trees:3
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
2:blob:a:$(git rev-parse base~1:a):UNINTERESTING
3:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
4:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
5:tree:right/:$(git rev-parse topic:right)
5:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
6:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
6:blob:right/c:$(git rev-parse topic:right/c)
7:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:2
tags:0
trees:5
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic, not base, boundary with pruning' '
test-tool path-walk --prune -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
2:tree:right/:$(git rev-parse topic:right)
2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
3:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
3:blob:right/c:$(git rev-parse topic:right/c)
blobs:2
commits:2
tags:0
trees:4
EOF
test_cmp_sorted expect out
'
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
(
cd unique-trees &&
mkdir initial &&
test_commit initial/file &&
git switch -c move-to-top &&
git mv initial/file.t ./ &&
test_tick &&
git commit -m moved &&
git update-ref refs/heads/other HEAD
) &&
test-tool -C unique-trees path-walk -- --all >out &&
tree=$(git -C unique-trees rev-parse HEAD:) &&
grep "$tree" out >out-filtered &&
test_line_count = 1 out-filtered
'
test_done