The previous change introduced a new pack_name_hash_v2() function that
intends to satisfy much of the hash locality features of the existing
pack_name_hash() function while also distinguishing paths with similar
final components of their paths.
This change adds a new --name-hash-version option for 'git pack-objects'
to allow users to select their preferred function version. This use of
an integer version allows for future expansion and a direct way to later
store a name hash version in the .bitmap format.
For now, let's consider how effective this mechanism is when repacking a
repository with different name hash versions. Specifically, we will
execute 'git pack-objects' the same way a 'git repack -adf' process
would, except we include --name-hash-version=<n> for testing.
On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 260 MB | N/A |
| --name-hash-version=1 | 127 MB | 129s |
| --name-hash-version=2 | 127 MB | 112s |
This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics for both versions.
Let's consider some repositories that are hitting too many collisions
with version 1. First, let's explore the kinds of paths that are
commonly causing these collisions:
* "/CHANGELOG.json" is 15 characters, and is created by the beachball
[1] tool. Only the final character of the parent directory can
differentiate different versions of this file, but also only the two
most-significant digits. If that character is a letter, then this is
always a collision. Similar issues occur with the similar
"/CHANGELOG.md" path, though there is more opportunity for
differences In the parent directory.
* Localization files frequently have common filenames but
differentiates via parent directories. In C#, the name
"/strings.resx.lcl" is used for these localization files and they
will all collide in name-hash.
[1] https://github.com/microsoft/beachball
I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.
One open-source example is the fluentui [2] repo, which uses beachball
to generate CHANGELOG.json and CHANGELOG.md files, and these files have
very poor delta characteristics when comparing against versions across
parent directories.
| Stage | Pack Size | Repack Time |
|-----------------------|-----------|-------------|
| After clone | 694 MB | N/A |
| --name-hash-version=1 | 438 MB | 728s |
| --name-hash-version=2 | 168 MB | 142s |
[2] https://github.com/microsoft/fluentui
In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.
Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:
| Repo | --name-hash-version=1 | --name-hash-version=2 |
|----------|-----------------------|-----------------------|
| fluentui | 440 MB | 161 MB |
| Repo B | 6,248 MB | 856 MB |
| Repo C | 37,278 MB | 6,755 MB |
| Repo D | 131,204 MB | 7,463 MB |
Future changes could include making --name-hash-version implied by a config
value or even implied by default during a full repack.
It is important to point out that the name hash value is stored in the
.bitmap file format, so we must force --name-hash-version=1 when bitmaps
are being read or written. Later, the bitmap format could be updated to
be aware of the name hash version so deltas can be quickly computed
across the bitmapped/not-bitmapped boundary. To promote the safety of
this parameter, the validate_name_hash_version() method will die() if
the given name-hash version is incorrect and will disable newer versions
if not yet compatible with other features, such as --write-bitmap-index.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The main function we use to search a pack index for an object is
find_pack_entry_one(). That function still takes a bare pointer to the
hash, despite the fact that its underlying bsearch_pack() function needs
an object_id struct. And so we end up making an extra copy of the hash
into the struct just to do a lookup.
As it turns out, all callers but one already have such an object_id. So
we can just take a pointer to that struct and use it directly. This
avoids the extra copy and provides a more type-safe interface.
The one exception is get_delta_base() in packfile.c, when we are chasing
a REF_DELTA from inside the pack (and thus we have a pointer directly to
the mmap'd pack memory, not a struct). We can just bump the hashcpy()
from inside find_pack_entry_one() to this one caller that needs it.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
The convention to calling into built-in command implementation has
been updated to pass the repository, if known, together with the
prefix value.
* jc/pass-repo-to-builtins:
add: pass in repo variable instead of global the_repository
builtin: remove USE_THE_REPOSITORY for those without the_repository
builtin: remove USE_THE_REPOSITORY_VARIABLE from builtin.h
builtin: add a repository parameter for builtin functions
Code clean-up.
* ps/environ-wo-the-repository: (21 commits)
environment: stop storing "core.notesRef" globally
environment: stop storing "core.warnAmbiguousRefs" globally
environment: stop storing "core.preferSymlinkRefs" globally
environment: stop storing "core.logAllRefUpdates" globally
refs: stop modifying global `log_all_ref_updates` variable
branch: stop modifying `log_all_ref_updates` variable
repo-settings: track defaults close to `struct repo_settings`
repo-settings: split out declarations into a standalone header
environment: guard state depending on a repository
environment: reorder header to split out `the_repository`-free section
environment: move `set_git_dir()` and related into setup layer
environment: make `get_git_namespace()` self-contained
environment: move object database functions into object layer
config: make dependency on repo in `read_early_config()` explicit
config: document `read_early_config()` and `read_very_early_config()`
environment: make `get_git_work_tree()` accept a repository
environment: make `get_graft_file()` accept a repository
environment: make `get_index_file()` accept a repository
environment: make `get_object_directory()` accept a repository
environment: make `get_git_common_dir()` accept a repository
...
Instead of including USE_THE_REPOSITORY_VARIABLE by default on every
builtin, remove it from builtin.h and add it to all the builtins that
include builtin.h (by definition, that means all builtins/*.c).
Also, remove the include statement for repository.h since it gets
brought in through builtin.h.
The next step will be to migrate each builtin
from having to use the_repository.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to reduce the usage of the global the_repository, add a
parameter to builtin functions that will get passed a repository
variable.
This commit uses UNUSED on most of the builtin functions, as subsequent
commits will modify the actual builtins to pass the repository parameter
down.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `get_object_directory()` function retrieves the path to the object
directory for `the_repository`. Make it accept a `struct repository`
such that it can work on arbitrary repositories and make it part of the
repository subsystem. This reduces our reliance on `the_repository` and
clarifies scope.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `--keep-pack` option of git-pack-objects(1) populates the arguments
into a string list. And while the list is marked as `NODUP` and thus
won't duplicate the strings, the list entries themselves still need to
be free'd. We don't though, causing a leak.
Plug it.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `write_reused_pack_one()` defines an header to store the
OFS_DELTA header, but uses the constant "10" instead of
"MAX_PACK_OBJECT_HEADER" (as is done elsewhere in the same patch, circa
bb514de356 (pack-objects: improve partial packfile reuse, 2019-12-18)).
Declare the `ofs_header` field to be sized according to
`MAX_PACK_OBJECT_HEADER` (which is 10, as defined in "pack.h") instead
of the constant 10.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When reusing chunks verbatim from an existing source pack, the function
write_reused_pack() first attempts to reuse whole words (via the
function `write_reused_pack_verbatim()`), and then individual bits (via
`write_reused_pack_one()`).
In the non-MIDX case, all of this code works fine. Likewise, in the MIDX
case, processing bits individually from the first (preferred) pack works
fine. However, processing subsequent packs in the MIDX case is broken
when there are duplicate objects among the set of MIDX'd packs.
This is because we treat the individual bit positions as valid pack
positions within the source pack(s), which does not account for gaps in
the source pack, like we see when the MIDX must break ties between
duplicate objects which appear in multiple packs.
The broken code looks like:
for (; i < reuse_packfile_bitmap->word_alloc; i++) {
for (offset = 0; offset < BITS_IN_EWORD, offset++) {
/* ... */
write_reused_pack_one(reuse_packfile->p,
pos + offset - reuse_packfile->bitmap_pos,
f, pack_start, &w_curs);
}
}
, where the second argument is incorrect and does not account for gaps.
Instead, make sure that we translate bit positions in the MIDX's
pseudo-pack order to pack positions in the respective source packs by:
- Translating the bit position (pseudo-pack order) to a MIDX position
(lexical order).
- Use the MIDX position to obtain the offset at which the given object
occurs in the source pack.
- Then translate that offset back into a pack relative position within
the source pack by calling offset_to_pack_pos().
After doing this, then we can safely use the result as a pack position.
Note that when doing single-pack reuse, as well as reusing objects from
the MIDX's preferred pack, such translation is not necessary, since
either ties are broken in favor of the preferred pack, or there are no
ties to break at all (in the case of non-MIDX bitmaps).
Failing to do this can result in strange failure modes. One example that
can occur when misinterpreting bits in the above fashion is that Git
thinks it's supposed to send a delta that the caller does not want.
Under this (incorrect) assumption, we try to look up the delta's base
(so that we can patch any OFS_DELTAs if necessary). We do this using
find_reused_offset().
But if we try and call that function for an offset belonging to an
object we did not send, we'll get back garbage. This can result in us
computing a negative fixup value, which results in memory corruption
when trying to write the (patched) OFS_DELTA header.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We created a useless pseudo-merge reachability bitmap that is about
0 commits, and attempted to include commits that are not in packs,
which made no sense. These bugs have been corrected.
* tb/pseudo-merge-bitmap-fixes:
pseudo-merge.c: ensure pseudo-merge groups are closed
pseudo-merge.c: do not generate empty pseudo-merge commits
t/t5333-pseudo-merge-bitmaps.sh: demonstrate empty pseudo-merge groups
pack-bitmap-write.c: select pseudo-merges even for small bitmaps
pack-bitmap: drop redundant args from `bitmap_writer_finish()`
pack-bitmap: drop redundant args from `bitmap_writer_build()`
pack-bitmap: drop redundant args from `bitmap_writer_build_type_index()`
pack-bitmap: initialize `bitmap_writer_init()` with packing_data
In a similar fashion as the previous commit, drop a redundant argument
from the `bitmap_writer_finish()` function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a similar fashion as the previous commit, drop a redundant argument
from the `bitmap_writer_build()` function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous commit ensures that the bitmap_writer's "to_pack" field is
initialized early on, so the "to_pack" and "index_nr" arguments to
`bitmap_writer_build_type_index()` are redundant.
Drop them and adjust the callers accordingly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In order to determine its object order, the pack-bitmap machinery keeps
a 'struct packing_data' corresponding to the pack or pseudo-pack (when
writing a MIDX bitmap) being written.
The to_pack field is provided to the bitmap machinery by callers of
bitmap_writer_build() and assigned to the bitmap_writer struct at that
point.
But a subsequent commit will want to have access to that data earlier on
during commit selection. Prepare for that by adding a 'to_pack' argument
to 'bitmap_writer_init()', and initializing the field during that
function.
Subsequent commits will clean up other functions which take
now-redundant arguments (like nr_objects, which is equivalent to
pdata->objects_nr, or pdata itself).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a parameter to each_ref_fn so that callers to the ref APIs
that use this function as a callback can have acess to the
unresolved value of a symbolic ref.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.
* ps/use-the-repository:
hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
t/helper: remove dependency on `the_repository` in "proc-receive"
t/helper: fix segfault in "oid-array" command without repository
t/helper: use correct object hash in partial-clone helper
compat/fsmonitor: fix socket path in networked SHA256 repos
replace-object: use hash algorithm from passed-in repository
protocol-caps: use hash algorithm from passed-in repository
oidset: pass hash algorithm when parsing file
http-fetch: don't crash when parsing packfile without a repo
hash-ll: merge with "hash.h"
refs: avoid include cycle with "repository.h"
global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
hash: require hash algorithm in `empty_tree_oid_hex()`
hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
hash: make `is_null_oid()` independent of `the_repository`
hash: convert `oidcmp()` and `oideq()` to compare whole hash
global: ensure that object IDs are always padded
hash: require hash algorithm in `oidread()` and `oidclr()`
hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
The pseudo-merge reachability bitmap to help more efficient storage
of the reachability bitmap in a repository with too many refs has
been added.
* tb/pseudo-merge-reachability-bitmap: (26 commits)
pack-bitmap.c: ensure pseudo-merge offset reads are bounded
Documentation/technical/bitmap-format.txt: add missing position table
t/perf: implement performance tests for pseudo-merge bitmaps
pseudo-merge: implement support for finding existing merges
ewah: `bitmap_equals_ewah()`
pack-bitmap: extra trace2 information
pack-bitmap.c: use pseudo-merges during traversal
t/test-lib-functions.sh: support `--notick` in `test_commit_bulk()`
pack-bitmap: implement test helpers for pseudo-merge
ewah: implement `ewah_bitmap_popcount()`
pseudo-merge: implement support for reading pseudo-merge commits
pack-bitmap.c: read pseudo-merge extension
pseudo-merge: scaffolding for reads
pack-bitmap: extract `read_bitmap()` function
pack-bitmap-write.c: write pseudo-merge table
pseudo-merge: implement support for selecting pseudo-merge commits
config: introduce `git_config_double()`
pack-bitmap: make `bitmap_writer_push_bitmapped_commit()` public
pack-bitmap: implement `bitmap_writer_has_bitmapped_object_id()`
pack-bitmap-write: support storing pseudo-merge commits
...
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further clean-up the refs subsystem to stop relying on
the_repository, and instead use the repository associated to the
ref_store object.
* ps/refs-without-the-repository-updates:
refs/packed: remove references to `the_hash_algo`
refs/files: remove references to `the_hash_algo`
refs/files: use correct repository
refs: remove `dwim_log()`
refs: drop `git_default_branch_name()`
refs: pass repo when peeling objects
refs: move object peeling into "object.c"
refs: pass ref store when detecting dangling symrefs
refs: convert iteration over replace refs to accept ref store
refs: retrieve worktree ref stores via associated repository
refs: refactor `resolve_gitlink_ref()` to accept a repository
refs: pass repo when retrieving submodule ref store
refs: track ref stores via strmap
refs: implement releasing ref storages
refs: rename `init_db` callback to avoid confusion
refs: adjust names for `init` and `init_db` callbacks
The pack bitmap code saw some clean-up to prepare for a follow-up topic.
* tb/pack-bitmap-write-cleanups:
pack-bitmap: introduce `bitmap_writer_free()`
pack-bitmap-write.c: avoid uninitialized 'write_as' field
pack-bitmap: drop unused `max_bitmaps` parameter
pack-bitmap: avoid use of static `bitmap_writer`
pack-bitmap-write.c: move commit_positions into commit_pos fields
object.h: add flags allocated by pack-bitmap.h
The pack-bitmap-writer machinery uses a oidmap (backed by khash.h) to
map from commits selected for bitmaps (by OID) to a bitmapped_commit
structure (containing the bitmap itself, among other things like its XOR
offset, etc.)
This map was initialized at the end of `bitmap_writer_build()`. New
entries are added in `pack-bitmap-write.c::store_selected()`, which is
called by the bitmap_builder machinery (which is responsible for
traversing history and generating the actual bitmaps).
Reorganize when this field is initialized and when entries are added to
it so that we can quickly determine whether a commit is a candidate for
pseudo-merge selection, or not (since it was already selected to receive
a bitmap, and thus storing it in a pseudo-merge would be redundant).
The changes are as follows:
- Introduce a new `bitmap_writer_init()` function which initializes
the `writer.bitmaps` field (instead of waiting until the end of
`bitmap_writer_build()`).
- Add map entries in `push_bitmapped_commit()` (which is called via
`bitmap_writer_select_commits()`) with OID keys and NULL values to
track whether or not we *expect* to write a bitmap for some given
commit.
- Validate that a NULL entry is found matching the given key when we
store a selected bitmap.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both `peel_object()` and `peel_iterated_oid()` implicitly rely on
`the_repository` to look up objects. Despite the fact that we want to
get rid of `the_repository`, it also leads to some restrictions in our
ref iterators when trying to retrieve the peeled value for a repository
other than `the_repository`.
Refactor these functions such that both take a repository as argument
and remove the now-unnecessary restrictions.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that there is clearer memory ownership around the bitmap_writer
structure, introduce a bitmap_writer_free() function that callers may
use to free any memory associated with their instance of the
bitmap_writer structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `max_bitmaps` parameter in `bitmap_writer_select_commits()` was
introduced back in 7cc8f97108 (pack-objects: implement bitmap writing,
2013-12-21), making it original to the bitmap implementation in Git
itself.
When that patch was merged via 0f9e62e084 (Merge branch
'jk/pack-bitmap', 2014-02-27), its sole caller in builtin/pack-objects.c
passed a value of "-1" for `max_bitmaps`, indicating no limit.
Since then, the only other caller (in midx.c, added via c528e17966
(pack-bitmap: write multi-pack bitmaps, 2021-08-31)) also uses a value
of "-1" for `max_bitmaps`.
Since no callers have needed a finite limit for the `max_bitmaps`
parameter in the nearly decade that has passed since 0f9e62e084, let's
remove the parameter and any dead pieces of code connected to it.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pack-bitmap machinery uses a structure called 'bitmap_writer' to
collect the data necessary to write out .bitmap files. Since its
introduction in 7cc8f97108 (pack-objects: implement bitmap writing,
2013-12-21), there has been a single static bitmap_writer structure,
which is responsible for all bitmap writing-related operations.
In practice, this is OK, since we are only ever writing a single .bitmap
file in a single process (e.g., `git multi-pack-index write --bitmap`,
`git pack-objects --write-bitmap-index`, `git repack -b`, etc.).
However, having a single static variable makes issues like data
ownership unclear, when to free variables, what has/hasn't been
initialized unclear.
Refactor this code to be written in terms of a given bitmap_writer
structure instead of relying on a static global.
Note that this exposes the structure definition of the bitmap_writer at
the pack-bitmap.h level. We could work around this by, e.g., forcing
callers to declare their writers as:
struct bitmap_writer *writer;
bitmap_writer_init(&bitmap_writer);
and then declaring `bitmap_writer_init()` as taking in a double-pointer
like so:
void bitmap_writer_init(struct bitmap_writer **writer);
which would avoid us having to expose the definition of the structure
itself. This patch takes a different approach, since future patches
(like for the ongoing pseudo-merge bitmaps work) will want to modify the
innards of this structure (in the previous example, via pseudo-merge.c).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply the rules that rewrite callers of "refs" interfaces to explicitly
pass `struct ref_store`. The resulting patch has been applied with the
`--whitespace=fix` option.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Work to support a repository that work with both SHA-1 and SHA-256
hash algorithms has started.
* eb/hash-transition: (30 commits)
t1016-compatObjectFormat: add tests to verify the conversion between objects
t1006: test oid compatibility with cat-file
t1006: rename sha1 to oid
test-lib: compute the compatibility hash so tests may use it
builtin/ls-tree: let the oid determine the output algorithm
object-file: handle compat objects in check_object_signature
tree-walk: init_tree_desc take an oid to get the hash algorithm
builtin/cat-file: let the oid determine the output algorithm
rev-parse: add an --output-object-format parameter
repository: implement extensions.compatObjectFormat
object-file: update object_info_extended to reencode objects
object-file-convert: convert commits that embed signed tags
object-file-convert: convert commit objects when writing
object-file-convert: don't leak when converting tag objects
object-file-convert: convert tag objects when writing
object-file-convert: add a function to convert trees between algorithms
object: factor out parse_mode out of fast-import and tree-walk into in object.h
cache: add a function to read an OID of a specific algorithm
tag: sign both hashes
commit: export add_header_signature to support handling signatures on tags
...
Now that multi-pack reuse is supported, enable it via the
feature.experimental configuration in addition to the classic
`pack.allowPackReuse`.
This will allow more users to experiment with the new behavior who might
not otherwise be aware of the existing `pack.allowPackReuse`
configuration option.
The enum with values NO_PACK_REUSE, SINGLE_PACK_REUSE, and
MULTI_PACK_REUSE is defined statically in builtin/pack-objects.c's
compilation unit. We could hoist that enum into a scope visible from the
repository_settings struct, and then use that enum value in
pack-objects. Instead, define a single int that indicates what
pack-objects's default value should be to avoid additional unnecessary
code movement.
Though `feature.experimental` implies `pack.allowPackReuse=multi`, this
can still be overridden by explicitly setting the latter configuration
to either "single" or "false". Tests covering all of these cases are
showin t5332.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Streaming spans of packfile data used to be done only from a
single, primary, pack in a repository with multiple packfiles. It
has been extended to allow reuse from other packfiles, too.
* tb/multi-pack-verbatim-reuse: (26 commits)
t/perf: add performance tests for multi-pack reuse
pack-bitmap: enable reuse from all bitmapped packs
pack-objects: allow setting `pack.allowPackReuse` to "single"
t/test-lib-functions.sh: implement `test_trace2_data` helper
pack-objects: add tracing for various packfile metrics
pack-bitmap: prepare to mark objects from multiple packs for reuse
pack-revindex: implement `midx_pair_to_pack_pos()`
pack-revindex: factor out `midx_key_to_pack_pos()` helper
midx: implement `midx_preferred_pack()`
git-compat-util.h: implement checked size_t to uint32_t conversion
pack-objects: include number of packs reused in output
pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse
pack-objects: prepare `write_reused_pack()` for multi-pack reuse
pack-objects: pass `bitmapped_pack`'s to pack-reuse functions
pack-objects: keep track of `pack_start` for each reuse pack
pack-objects: parameterize pack-reuse routines over a single pack
pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()`
pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature
ewah: implement `bitmap_is_empty()`
pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions
...
Remove unused header "#include".
* en/header-cleanup:
treewide: remove unnecessary includes in source files
treewide: add direct includes currently only pulled in transitively
trace2/tr2_tls.h: remove unnecessary include
submodule-config.h: remove unnecessary include
pkt-line.h: remove unnecessary include
line-log.h: remove unnecessary include
http.h: remove unnecessary include
fsmonitor--daemon.h: remove unnecessary includes
blame.h: remove unnecessary includes
archive.h: remove unnecessary include
treewide: remove unnecessary includes in source files
treewide: remove unnecessary includes from header files
Each of these were checked with
gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).
...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file. These cases were:
* builtin/credential-cache.c
* builtin/pull.c
* builtin/send-pack.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that both the pack-bitmap and pack-objects code are prepared to
handle marking and using objects from multiple bitmapped packs for
verbatim reuse, allow marking objects from all bitmapped packs as
eligible for reuse.
Within the `reuse_partial_packfile_from_bitmap()` function, we no longer
only mark the pack whose first object is at bit position zero for reuse,
and instead mark any pack contained in the MIDX as a reuse candidate.
Provide a handful of test cases in a new script (t5332) exercising
interesting behavior for multi-pack reuse to ensure that we performed
all of the previous steps correctly.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In e704fc7978 (pack-objects: introduce pack.allowPackReuse, 2019-12-18),
the `pack.allowPackReuse` configuration option was introduced, allowing
users to disable the pack reuse mechanism.
To prepare for debugging multi-pack reuse, allow setting configuration
to "single" in addition to the usual bool-or-int values.
"single" implies the same behavior as "true", "1", "yes", and so on. But
it will complement a new "multi" value (to be introduced in a future
commit). When set to "single", we will only perform pack reuse on a
single pack, regardless of whether or not there are multiple MIDX'd
packs.
This requires no code changes (yet), since we only support single pack
reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As part of the multi-pack reuse effort, we will want to add some tests
that assert that we reused a certain number of objects from a certain
number of packs.
We could do this by grepping through the stderr output of
`pack-objects`, but doing so would be brittle in case the output format
changed.
Instead, let's use the trace2 mechanism to log various pieces of
information about the generated packfile, which we can then use to
compare against desired values.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In addition to including the number of objects reused verbatim from a
reuse-pack, include the number of packs from which objects were reused.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `write_reused_pack_verbatim()` within
`builtin/pack-objects.c` is responsible for writing out a continuous
set of objects beginning at the start of the reuse packfile.
In the existing implementation, we did something like:
while (pos < reuse_packfile_bitmap->word_alloc &&
reuse_packfile_bitmap->words[pos] == (eword_t)~0)
pos++;
if (pos)
/* write first `pos * BITS_IN_WORD` objects from pack */
as an optimization to record a single chunk for the longest continuous
prefix of objects wanted out of the reuse pack, instead of having a
chunk for each individual object. For more details, see bb514de356
(pack-objects: improve partial packfile reuse, 2019-12-18).
In order to retain this optimization in a multi-pack reuse world, we can
no longer assume that the first object in a pack is on a word boundary
in the bitmap storing the set of reusable objects.
Assuming that all objects from the beginning of the reuse packfile up to
the object corresponding to the first bit on a word boundary are part of
the result, consume whole words at a time until the last whole word
belonging to the reuse packfile. Copy those objects to the resulting
packfile, and track that we reused them by recording a single chunk.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `write_reused_pack()` within `builtin/pack-objects.c` is
responsible for performing pack-reuse on a single pack, and has two main
functions:
- it dispatches a call to `write_reused_pack_verbatim()` to see if we
can reuse portions of the packfile in whole-word chunks
- for any remaining objects (that is, any objects that appear after
the first "gap" in the bitmap), call write_reused_pack_one() on that
object to record it for reuse.
Prepare this function for multi-pack reuse by removing the assumption
that the bit position corresponding to the first object being reused
from a given pack must be at bit position zero.
The changes in this function are mostly straightforward. Initialize `i`
to the position of the first word to contain bits corresponding to that
reuse pack. In most situations, we throw the initialized value away,
since we end up replacing it with the return value from
write_reused_pack_verbatim(), moving us past the section of whole words
that we reused.
Likewise, modify the per-object loop to ignore any bits at the beginning
of the first word that do not belong to the pack currently being reused,
as well as skip to the "done" section once we have processed the last
bit corresponding to this pack.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further prepare pack-objects to perform verbatim pack-reuse over
multiple packfiles by converting functions that take in a pointer to a
`struct packed_git` to instead take in a pointer to a `struct
bitmapped_pack`.
The additional information found in the bitmapped_pack struct (such as
the bit position corresponding to the beginning of the pack) will be
necessary in order to perform verbatim pack-reuse.
Note that we don't use any of the extra pieces of information contained
in the bitmapped_pack struct, so this step is merely preparatory and
does not introduce any functional changes.
Note further that we do not change the argument type to
write_reused_pack_one(). That function is responsible for copying
sections of the packfile directly and optionally patching any OFS_DELTAs
to account for not reusing sections of the packfile in between a delta
and its base.
As such, that function is (and should remain) oblivious to multi-pack
reuse, and does not require any of the extra pieces of information
stored in the bitmapped_pack struct.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When reusing objects from a pack, we keep track of a set of one or more
`reused_chunk`s, corresponding to sections of one or more object(s) from
a source pack that we are reusing. Each chunk contains two pieces of
information:
- the offset of the first object in the source pack (relative to the
beginning of the source pack)
- the difference between that offset, and the corresponding offset in
the pack we're generating
The purpose of keeping track of these is so that we can patch an
OFS_DELTAs that cross over a section of the reuse pack that we didn't
take.
For instance, consider a hypothetical pack as shown below:
(chunk #2)
__________...
/
/
+--------+---------+-------------------+---------+
... | <base> | <other> | (unused) | <delta> | ...
+--------+---------+-------------------+---------+
\ /
\______________/
(chunk #1)
Suppose that we are sending objects "base", "other", and "delta", and
that the "delta" object is stored as an OFS_DELTA, and that its base is
"base". If we don't send any objects in the "(unused)" range, we can't
copy the delta'd object directly, since its delta offset includes a
range of the pack that we didn't copy, so we have to account for that
difference when patching and reassembling the delta.
In order to compute this value correctly, we need to know not only where
we are in the packfile we're assembling (with `hashfile_total(f)`) but
also the position of the first byte of the packfile that we are
currently reusing. Currently, this works just fine, since when reusing
only a single pack those two values are always identical (because
verbatim reuse is the first thing pack-objects does when enabled after
writing the pack header).
But when reusing multiple packs which have one or more gaps, we'll need
to account for these two values diverging.
Together, these two allow us to compute the reused chunk's offset
difference relative to the start of the reused pack, as desired.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The routines pack-objects uses to perform verbatim pack-reuse are:
- write_reused_pack_one()
- write_reused_pack_verbatim()
- write_reused_pack()
, all of which assume that there is exactly one packfile being reused:
the global constant `reuse_packfile`.
Prepare for reusing objects from multiple packs by making reuse packfile
a parameter of each of the above functions in preparation for calling
these functions in a loop with multiple packfiles.
Note that we still have the global "reuse_packfile", but pass it through
each of the above function's parameter lists, eliminating all but one
direct access (the top-level caller in `write_pack_file()`). Even after
this series, we will still have a global, but it will hold the array of
reusable packfiles, and we'll pass them one at a time to these functions
in a loop.
Note also that we will eventually need to pass a `bitmapped_pack`
instead of a `packed_git` in order to hold onto additional information
required for reuse (such as the bit position of the first object
belonging to that pack). But that change will be made in a future commit
so as to minimize the noise below as much as possible.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Further prepare for enabling verbatim pack-reuse over multiple packfiles
by changing the signature of reuse_partial_packfile_from_bitmap() to
populate an array of `struct bitmapped_pack *`'s instead of a pointer to
a single packfile.
Since the array we're filling out is sized dynamically[^1], add an
additional `size_t *` parameter which will hold the number of reusable
packs (equal to the number of elements in the array).
Note that since we still have not implemented true multi-pack reuse,
these changes aren't propagated out to the rest of the caller in
builtin/pack-objects.c.
In the interim state, we expect that the array has a single element, and
we use that element to fill out the static `reuse_packfile` variable
(which is a bog-standard `struct packed_git *`). Future commits will
continue to push this change further out through the pack-objects code.
[^1]: That is, even though we know the number of packs which are
candidates for pack-reuse, we do not know how many of those
candidates we can actually reuse.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The signature of `reuse_partial_packfile_from_bitmap()` currently takes
in a bitmap, as well as three output parameters (filled through
pointers, and passed as arguments), and also returns an integer result.
The output parameters are filled out with: (a) the packfile used for
pack-reuse, (b) the number of objects from that pack that we can reuse,
and (c) a bitmap indicating which objects we can reuse. The return value
is either -1 (when there are no objects to reuse), or 0 (when there is
at least one object to reuse).
Some of these parameters are redundant. Notably, we can infer from the
bitmap how many objects are reused by calling bitmap_popcount(). And we
can similar compute the return value based on that number as well.
As such, clean up the signature of this function to drop the "*entries"
parameter, as well as the int return value, since the single caller of
this function can infer these values themself.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pack-objects internals use a packing_data struct to track what
objects are part of the pack(s) being formed.
Since these structures contain allocated fields, failing to
appropriately free() them results in a leak. Plug that leak by
introducing a clear_packing_data() function, and call it in the
appropriate spots.
This is a fairly straightforward leak to plug, since none of the callers
expect to read any values or have any references to parts of the address
space being freed.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When the config parser sees an "implicit" bool like:
[core]
someVariable
it passes NULL to the config callback. Any callback code which expects a
string must check for NULL. This usually happens via helpers like
git_config_string(), etc, but some custom code forgets to do so and will
segfault.
These are all fairly vanilla cases where the solution is just the usual
pattern of:
if (!value)
return config_error_nonbool(var);
though note that in a few cases we have to split initializers like:
int some_var = initializer();
into:
int some_var;
if (!value)
return config_error_nonbool(var);
some_var = initializer();
There are still some broken instances after this patch, which I'll
address on their own in individual patches after this one.
Reported-by: Carlos Andrés Ramírez Cataño <antaigroupltda@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
To make it possible for git ls-tree to display the tree encoded
in the hash algorithm of the oid specified to git ls-tree, update
init_tree_desc to take as a parameter the oid of the tree object.
Update all callers of init_tree_desc and init_tree_desc_gently
to pass the oid of the tree object.
Use the oid of the tree object to discover the hash algorithm
of the oid and store that hash algorithm in struct tree_desc.
Use the hash algorithm in decode_tree_entry and
update_tree_entry_internal to handle reading a tree object encoded in
a hash algorithm that differs from the repositories hash algorithm.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
9535ce7337 (pack-objects: add list-objects filtering, 2017-11-21)
taught `git pack-objects` to use `--filter`, but required the use of
`--stdout` since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27beaa23
(promisor-remote: implement promisor_remote_get_direct(), 2019-06-25)
and others added support to dynamically fetch objects that were missing.
Even without a promisor remote, filtering out objects can also be useful
if we can put the filtered out objects in a separate pack, and in this
case it also makes sense for pack-objects to write the packfile directly
to an actual file rather than on stdout.
Remove the `--stdout` requirement when using `--filter`, so that in a
follow-up commit, repack can pass `--filter` to pack-objects to omit
certain objects from the resulting packfile.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Unused parameters to functions are marked as such, and/or removed,
in order to bring us closer to -Wunused-parameter clean.
* jk/unused-post-2.42-part2:
parse-options: mark unused parameters in noop callback
interpret-trailers: mark unused "unset" parameters in option callbacks
parse-options: add more BUG_ON() annotations
merge: do not pass unused opt->value parameter
parse-options: mark unused "opt" parameter in callbacks
parse-options: prefer opt->value to globals in callbacks
checkout-index: delay automatic setting of to_tempfile
format-patch: use OPT_STRING_LIST for to/cc options
merge: simplify parsing of "-n" option
merge: make xopts a strvec
The previous commit argued that parse-options callbacks should try to
use opt->value rather than touching globals directly. In some cases,
however, that's awkward to do. Some callbacks touch multiple variables,
or may even just call into an abstracted function that does so.
In some of these cases we _could_ convert them by stuffing the multiple
variables into a single struct and passing the struct pointer through
opt->value. But that may make other parts of the code less readable,
as the struct relationship has to be mentioned everywhere.
Let's just accept that these cases are special and leave them as-is. But
we do need to mark their "opt" parameters to satisfy -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>