
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
81 lines
2.3 KiB
C
81 lines
2.3 KiB
C
/*
|
|
* path-walk.h : Methods and structures for walking the object graph in batches
|
|
* by the paths that can reach those objects.
|
|
*/
|
|
#include "object.h" /* Required for 'enum object_type'. */
|
|
|
|
struct rev_info;
|
|
struct oid_array;
|
|
struct pattern_list;
|
|
|
|
/**
|
|
* The type of a function pointer for the method that is called on a list of
|
|
* objects reachable at a given path.
|
|
*/
|
|
typedef int (*path_fn)(const char *path,
|
|
struct oid_array *oids,
|
|
enum object_type type,
|
|
void *data);
|
|
|
|
struct path_walk_info {
|
|
/**
|
|
* revs provides the definitions for the commit walk, including
|
|
* which commits are UNINTERESTING or not. This structure is
|
|
* expected to be owned by the caller.
|
|
*/
|
|
struct rev_info *revs;
|
|
|
|
/**
|
|
* The caller wishes to execute custom logic on objects reachable at a
|
|
* given path. Every reachable object will be visited exactly once, and
|
|
* the first path to see an object wins. This may not be a stable choice.
|
|
*/
|
|
path_fn path_fn;
|
|
void *path_fn_data;
|
|
|
|
/**
|
|
* Initialize which object types the path_fn should be called on. This
|
|
* could also limit the walk to skip blobs if not set.
|
|
*/
|
|
int commits;
|
|
int trees;
|
|
int blobs;
|
|
int tags;
|
|
|
|
/**
|
|
* When 'prune_all_uninteresting' is set and a path has all objects
|
|
* marked as UNINTERESTING, then the path-walk will not visit those
|
|
* objects. It will not call path_fn on those objects and will not
|
|
* walk the children of such trees.
|
|
*/
|
|
int prune_all_uninteresting;
|
|
|
|
/**
|
|
* Specify a sparse-checkout definition to match our paths to. Do not
|
|
* walk outside of this sparse definition. If the patterns are in
|
|
* cone mode, then the search may prune directories that are outside
|
|
* of the cone. If not in cone mode, then all tree paths will be
|
|
* explored but the path_fn will only be called when the path matches
|
|
* the sparse-checkout patterns.
|
|
*/
|
|
struct pattern_list *pl;
|
|
};
|
|
|
|
#define PATH_WALK_INFO_INIT { \
|
|
.blobs = 1, \
|
|
.trees = 1, \
|
|
.commits = 1, \
|
|
.tags = 1, \
|
|
}
|
|
|
|
void path_walk_info_init(struct path_walk_info *info);
|
|
void path_walk_info_clear(struct path_walk_info *info);
|
|
|
|
/**
|
|
* Given the configuration of 'info', walk the commits based on 'info->revs' and
|
|
* call 'info->path_fn' on each discovered path.
|
|
*
|
|
* Returns nonzero on an error.
|
|
*/
|
|
int walk_objects_by_path(struct path_walk_info *info);
|