backfill: basic functionality and tests

The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

The path-walk API provides lists of objects in batches according to a
common path, but that list could be very small. We want to balance the
number of requests to the server with the ability to have the process
interrupted with minimal repeated work to catch up in the next run.
Based on some experiments (detailed in the next change) a minimum batch
size of 50,000 is selected for the default.

This batch size is a _minimum_. As the path-walk API emits lists of blob
IDs, they are collected into a list of objects for a request to the
server. When that list is at least the minimum batch size, then the
request is sent to the server for the new objects. However, the list of
blob IDs from the path-walk API could be much longer than the batch
size. At this moment, it is unclear if there is a benefit to split the
list when there are too many objects at the same path.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
Derrick Stolee
2025-02-03 17:11:04 +00:00
committed by Junio C Hamano
parent a3f79e9abd
commit 1e72e889e7
5 changed files with 227 additions and 4 deletions

View File

@ -14,6 +14,37 @@ git backfill [<options>]
DESCRIPTION
-----------
Blobless partial clones are created using `git clone --filter=blob:none`
and then configure the local repository such that the Git client avoids
downloading blob objects unless they are required for a local operation.
This initially means that the clone and later fetches download reachable
commits and trees but no blobs. Later operations that change the `HEAD`
pointer, such as `git checkout` or `git merge`, may need to download
missing blobs in order to complete their operation.
In the worst cases, commands that compute blob diffs, such as `git blame`,
become very slow as they download the missing blobs in single-blob
requests to satisfy the missing object as the Git command needs it. This
leads to multiple download requests and no ability for the Git server to
provide delta compression across those objects.
The `git backfill` command provides a way for the user to request that
Git downloads the missing blobs (with optional filters) such that the
missing blobs representing historical versions of files can be downloaded
in batches. The `backfill` command attempts to optimize the request by
grouping blobs that appear at the same path, hopefully leading to good
delta compression in the packfile sent by the server.
In this way, `git backfill` provides a mechanism to break a large clone
into smaller chunks. Starting with a blobless partial clone with `git
clone --filter=blob:none` and then running `git backfill` in the local
repository provides a way to download all reachable objects in several
smaller network calls than downloading the entire repository at clone
time.
By default, `git backfill` downloads all blobs reachable from the `HEAD`
commit. This set can be restricted or expanded using various options.
THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
SEE ALSO

View File

@ -60,4 +60,5 @@ Examples
--------
See example usages in:
`t/helper/test-path-walk.c`
`t/helper/test-path-walk.c`,
`builtin/backfill.c`