backfill: add --sparse option

One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.

However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.

Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.

This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.

Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout. Non-cone mode can describe the included files
using both positive and negative patterns, which changes the possible return
values of path_matches_pattern_list(). Test both kinds of matches for
increased coverage.

To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are

 1. git clone --filter=blob:none --sparse <url>
 2. git sparse-checkout set <dir1> ... <dirN>
 3. git backfill --sparse

For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:

| Batch Size      | Pack Count | Pack Size | Time  |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3          | 110 MB    |       |
| 10K             | 12         | 192 MB    | 17.2s |
| 15K             | 9          | 192 MB    | 15.5s |
| 20K             | 8          | 192 MB    | 15.5s |
| 25K             | 7          | 192 MB    | 14.7s |

This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.

Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:

| Batch Size      | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2          | 1,876 MB  |      |
| 10K             | 11         | 2,187 MB  | 46s  |
| 25K             | 7          | 2,188 MB  | 43s  |
| 50K             | 5          | 2,194 MB  | 44s  |
| 100K            | 4          | 2,194 MB  | 48s  |

This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.

Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
Derrick Stolee
2025-02-03 17:11:06 +00:00
committed by Junio C Hamano
parent 6840fe9ee2
commit bff4555767
10 changed files with 208 additions and 15 deletions

View File

@ -9,7 +9,7 @@ git-backfill - Download missing objects in a partial clone
SYNOPSIS SYNOPSIS
-------- --------
[synopsis] [synopsis]
git backfill [--min-batch-size=<n>] git backfill [--min-batch-size=<n>] [--[no-]sparse]
DESCRIPTION DESCRIPTION
----------- -----------
@ -57,6 +57,10 @@ OPTIONS
blobs seen at a given path. The default minimum batch size is blobs seen at a given path. The default minimum batch size is
50,000. 50,000.
`--[no-]sparse`::
Only download objects if they appear at a path that matches the
current sparse-checkout.
SEE ALSO SEE ALSO
-------- --------
linkgit:git-clone[1]. linkgit:git-clone[1].

View File

@ -56,6 +56,14 @@ better off using the revision walk API instead.
the revision walk so that the walk emits commits marked with the the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag. `UNINTERESTING` flag.
`pl`::
This pattern list pointer allows focusing the path-walk search to
a set of patterns, only emitting paths that match the given
patterns. See linkgit:gitignore[5] or
linkgit:git-sparse-checkout[1] for details about pattern lists.
When the pattern list uses cone-mode patterns, then the path-walk
API can prune the set of paths it walks to improve performance.
Examples Examples
-------- --------

View File

@ -4,6 +4,7 @@
#include "parse-options.h" #include "parse-options.h"
#include "repository.h" #include "repository.h"
#include "commit.h" #include "commit.h"
#include "dir.h"
#include "hex.h" #include "hex.h"
#include "tree.h" #include "tree.h"
#include "tree-walk.h" #include "tree-walk.h"
@ -21,7 +22,7 @@
#include "path-walk.h" #include "path-walk.h"
static const char * const builtin_backfill_usage[] = { static const char * const builtin_backfill_usage[] = {
N_("git backfill [--min-batch-size=<n>]"), N_("git backfill [--min-batch-size=<n>] [--[no-]sparse]"),
NULL NULL
}; };
@ -29,6 +30,7 @@ struct backfill_context {
struct repository *repo; struct repository *repo;
struct oid_array current_batch; struct oid_array current_batch;
size_t min_batch_size; size_t min_batch_size;
int sparse;
}; };
static void backfill_context_clear(struct backfill_context *ctx) static void backfill_context_clear(struct backfill_context *ctx)
@ -78,6 +80,14 @@ static int do_backfill(struct backfill_context *ctx)
struct path_walk_info info = PATH_WALK_INFO_INIT; struct path_walk_info info = PATH_WALK_INFO_INIT;
int ret; int ret;
if (ctx->sparse) {
CALLOC_ARRAY(info.pl, 1);
if (get_sparse_checkout_patterns(info.pl)) {
path_walk_info_clear(&info);
return error(_("problem loading sparse-checkout"));
}
}
repo_init_revisions(ctx->repo, &revs, ""); repo_init_revisions(ctx->repo, &revs, "");
handle_revision_arg("HEAD", &revs, 0, 0); handle_revision_arg("HEAD", &revs, 0, 0);
@ -106,10 +116,13 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
.repo = repo, .repo = repo,
.current_batch = OID_ARRAY_INIT, .current_batch = OID_ARRAY_INIT,
.min_batch_size = 50000, .min_batch_size = 50000,
.sparse = 0,
}; };
struct option options[] = { struct option options[] = {
OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size, OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
N_("Minimum number of objects to request at a time")), N_("Minimum number of objects to request at a time")),
OPT_BOOL(0, "sparse", &ctx.sparse,
N_("Restrict the missing objects to the current sparse-checkout")),
OPT_END(), OPT_END(),
}; };

6
dir.c
View File

@ -1093,10 +1093,6 @@ static void invalidate_directory(struct untracked_cache *uc,
dir->dirs[i]->recurse = 0; dir->dirs[i]->recurse = 0;
} }
static int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);
/* Flags for add_patterns() */ /* Flags for add_patterns() */
#define PATTERN_NOFOLLOW (1<<0) #define PATTERN_NOFOLLOW (1<<0)
@ -1186,7 +1182,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
return 0; return 0;
} }
static int add_patterns_from_buffer(char *buf, size_t size, int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen, const char *base, int baselen,
struct pattern_list *pl) struct pattern_list *pl)
{ {

3
dir.h
View File

@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
int add_patterns_from_blob_to_list(struct object_id *oid, int add_patterns_from_blob_to_list(struct object_id *oid,
const char *base, int baselen, const char *base, int baselen,
struct pattern_list *pl); struct pattern_list *pl);
int add_patterns_from_buffer(char *buf, size_t size,
const char *base, int baselen,
struct pattern_list *pl);
void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen); void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
void add_pattern(const char *string, const char *base, void add_pattern(const char *string, const char *base,
int baselen, struct pattern_list *pl, int srcpos); int baselen, struct pattern_list *pl, int srcpos);

View File

@ -12,6 +12,7 @@
#include "object.h" #include "object.h"
#include "oid-array.h" #include "oid-array.h"
#include "prio-queue.h" #include "prio-queue.h"
#include "repository.h"
#include "revision.h" #include "revision.h"
#include "string-list.h" #include "string-list.h"
#include "strmap.h" #include "strmap.h"
@ -172,6 +173,23 @@ static int add_tree_entries(struct path_walk_context *ctx,
if (type == OBJ_TREE) if (type == OBJ_TREE)
strbuf_addch(&path, '/'); strbuf_addch(&path, '/');
if (ctx->info->pl) {
int dtype;
enum pattern_match_result match;
match = path_matches_pattern_list(path.buf, path.len,
path.buf + base_len, &dtype,
ctx->info->pl,
ctx->repo->index);
if (ctx->info->pl->use_cone_patterns &&
match == NOT_MATCHED)
continue;
else if (!ctx->info->pl->use_cone_patterns &&
type == OBJ_BLOB &&
match != MATCHED)
continue;
}
if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) { if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
CALLOC_ARRAY(list, 1); CALLOC_ARRAY(list, 1);
list->type = type; list->type = type;
@ -582,10 +600,10 @@ void path_walk_info_init(struct path_walk_info *info)
memcpy(info, &empty, sizeof(empty)); memcpy(info, &empty, sizeof(empty));
} }
void path_walk_info_clear(struct path_walk_info *info UNUSED) void path_walk_info_clear(struct path_walk_info *info)
{ {
/* if (info->pl) {
* This destructor is empty for now, as info->revs clear_pattern_list(info->pl);
* is not owned by 'struct path_walk_info'. free(info->pl);
*/ }
} }

View File

@ -6,6 +6,7 @@
struct rev_info; struct rev_info;
struct oid_array; struct oid_array;
struct pattern_list;
/** /**
* The type of a function pointer for the method that is called on a list of * The type of a function pointer for the method that is called on a list of
@ -48,6 +49,16 @@ struct path_walk_info {
* walk the children of such trees. * walk the children of such trees.
*/ */
int prune_all_uninteresting; int prune_all_uninteresting;
/**
* Specify a sparse-checkout definition to match our paths to. Do not
* walk outside of this sparse definition. If the patterns are in
* cone mode, then the search may prune directories that are outside
* of the cone. If not in cone mode, then all tree paths will be
* explored but the path_fn will only be called when the path matches
* the sparse-checkout patterns.
*/
struct pattern_list *pl;
}; };
#define PATH_WALK_INFO_INIT { \ #define PATH_WALK_INFO_INIT { \

View File

@ -1,6 +1,7 @@
#define USE_THE_REPOSITORY_VARIABLE #define USE_THE_REPOSITORY_VARIABLE
#include "test-tool.h" #include "test-tool.h"
#include "dir.h"
#include "environment.h" #include "environment.h"
#include "hex.h" #include "hex.h"
#include "object-name.h" #include "object-name.h"
@ -9,6 +10,7 @@
#include "revision.h" #include "revision.h"
#include "setup.h" #include "setup.h"
#include "parse-options.h" #include "parse-options.h"
#include "strbuf.h"
#include "path-walk.h" #include "path-walk.h"
#include "oid-array.h" #include "oid-array.h"
@ -65,7 +67,7 @@ static int emit_block(const char *path, struct oid_array *oids,
int cmd__path_walk(int argc, const char **argv) int cmd__path_walk(int argc, const char **argv)
{ {
int res; int res, stdin_pl = 0;
struct rev_info revs = REV_INFO_INIT; struct rev_info revs = REV_INFO_INIT;
struct path_walk_info info = PATH_WALK_INFO_INIT; struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 }; struct path_walk_test_data data = { 0 };
@ -80,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tree objects")), N_("toggle inclusion of tree objects")),
OPT_BOOL(0, "prune", &info.prune_all_uninteresting, OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
N_("toggle pruning of uninteresting paths")), N_("toggle pruning of uninteresting paths")),
OPT_BOOL(0, "stdin-pl", &stdin_pl,
N_("read a pattern list over stdin")),
OPT_END(), OPT_END(),
}; };
@ -99,6 +103,17 @@ int cmd__path_walk(int argc, const char **argv)
info.path_fn = emit_block; info.path_fn = emit_block;
info.path_fn_data = &data; info.path_fn_data = &data;
if (stdin_pl) {
struct strbuf in = STRBUF_INIT;
CALLOC_ARRAY(info.pl, 1);
info.pl->use_cone_patterns = 1;
strbuf_fread(&in, 2048, stdin);
add_patterns_from_buffer(in.buf, in.len, "", 0, info.pl);
strbuf_release(&in);
}
res = walk_objects_by_path(&info); res = walk_objects_by_path(&info);
printf("commits:%" PRIuMAX "\n" printf("commits:%" PRIuMAX "\n"
@ -107,6 +122,11 @@ int cmd__path_walk(int argc, const char **argv)
"tags:%" PRIuMAX "\n", "tags:%" PRIuMAX "\n",
data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr); data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
if (info.pl) {
clear_pattern_list(info.pl);
free(info.pl);
}
release_revisions(&revs); release_revisions(&revs);
return res; return res;
} }

View File

@ -77,6 +77,94 @@ test_expect_success 'do partial clone 2, backfill min batch size' '
test_line_count = 0 revs2 test_line_count = 0 revs2
' '
test_expect_success 'backfill --sparse' '
git clone --sparse --filter=blob:none \
--single-branch --branch=main \
"file://$(pwd)/srv.bare" backfill3 &&
# Initial checkout includes four files at root.
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 44 missing &&
# Initial sparse-checkout is just the files at root, so we get the
# older versions of the four files at tip.
GIT_TRACE2_EVENT="$(pwd)/sparse-trace1" git \
-C backfill3 backfill --sparse &&
test_trace2_data promisor fetch_count 4 <sparse-trace1 &&
test_trace2_data path-walk paths 5 <sparse-trace1 &&
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 40 missing &&
# Expand the sparse-checkout to include 'd' recursively. This
# engages the algorithm to skip the trees for 'a'. Note that
# the "sparse-checkout set" command downloads the objects at tip
# to satisfy the current checkout.
git -C backfill3 sparse-checkout set d &&
GIT_TRACE2_EVENT="$(pwd)/sparse-trace2" git \
-C backfill3 backfill --sparse &&
test_trace2_data promisor fetch_count 8 <sparse-trace2 &&
test_trace2_data path-walk paths 15 <sparse-trace2 &&
git -C backfill3 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 24 missing
'
test_expect_success 'backfill --sparse without cone mode (positive)' '
git clone --no-checkout --filter=blob:none \
--single-branch --branch=main \
"file://$(pwd)/srv.bare" backfill4 &&
# No blobs yet
git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 48 missing &&
# Define sparse-checkout by filename regardless of parent directory.
# This downloads 6 blobs to satisfy the checkout.
git -C backfill4 sparse-checkout set --no-cone "**/file.1.txt" &&
git -C backfill4 checkout main &&
# Track new blob count
git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 42 missing &&
GIT_TRACE2_EVENT="$(pwd)/no-cone-trace1" git \
-C backfill4 backfill --sparse &&
test_trace2_data promisor fetch_count 6 <no-cone-trace1 &&
# This walk needed to visit all directories to search for these paths.
test_trace2_data path-walk paths 12 <no-cone-trace1 &&
git -C backfill4 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 36 missing
'
test_expect_success 'backfill --sparse without cone mode (negative)' '
git clone --no-checkout --filter=blob:none \
--single-branch --branch=main \
"file://$(pwd)/srv.bare" backfill5 &&
# No blobs yet
git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 48 missing &&
# Define sparse-checkout by filename regardless of parent directory.
# This downloads 18 blobs to satisfy the checkout
git -C backfill5 sparse-checkout set --no-cone "**/file*" "!**/file.1.txt" &&
git -C backfill5 checkout main &&
# Track new blob count
git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 30 missing &&
GIT_TRACE2_EVENT="$(pwd)/no-cone-trace2" git \
-C backfill5 backfill --sparse &&
test_trace2_data promisor fetch_count 18 <no-cone-trace2 &&
# This walk needed to visit all directories to search for these paths, plus
# 12 extra "file.?.txt" paths than the previous test.
test_trace2_data path-walk paths 24 <no-cone-trace2 &&
git -C backfill5 rev-list --quiet --objects --missing=print HEAD >missing &&
test_line_count = 12 missing
'
. "$TEST_DIRECTORY"/lib-httpd.sh . "$TEST_DIRECTORY"/lib-httpd.sh
start_httpd start_httpd

View File

@ -176,6 +176,38 @@ test_expect_success 'branches and indexed objects mix well' '
test_cmp_sorted expect out test_cmp_sorted expect out
' '
test_expect_success 'base & topic, sparse' '
cat >patterns <<-EOF &&
/*
!/*/
/left/
EOF
test-tool path-walk --stdin-pl -- base topic <patterns >out &&
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
0:commit::$(git rev-parse base~1)
0:commit::$(git rev-parse base~2)
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
2:blob:a:$(git rev-parse base~2:a)
3:tree:left/:$(git rev-parse base:left)
3:tree:left/:$(git rev-parse base~2:left)
4:blob:left/b:$(git rev-parse base~2:left/b)
4:blob:left/b:$(git rev-parse base:left/b)
blobs:3
commits:4
tags:0
trees:6
EOF
test_cmp_sorted expect out
'
test_expect_success 'topic only' ' test_expect_success 'topic only' '
test-tool path-walk -- topic >out && test-tool path-walk -- topic >out &&