When write_midx_internal() loads an existing MIDX, all packs are copied
forward into the new MIDX. Improve the readability of
write_midx_internal() by extracting this functionality out into a
separate function.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The add_pack_to_midx() callback used via for_each_file_in_pack_dir() is
used to add packs with .idx files to the MIDX being written.
Within this function, we have a pair of checks that discards packs
which:
- appear in an existing MIDX, if we successfully read an existing MIDX
from disk
- or, appear in the "to_include" list, if invoking the MIDX write
machinery with the `--stdin-packs` command-line argument.
A future commit will want to call a slight variant of these checks from
the code that reuses all packs from an existing MIDX, as well as the
current location via add_pack_to_midx(). The latter will be modified in
subsequent commits to only reuse packs which appear in the to_include
list, if one was given.
Prepare for that step by extracting these checks as a subroutine that
may be called from both places.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `compute_sorted_entries()` is broadly responsible for
building an array of the objects to be written into a MIDX based on the
provided list of packs.
If we have loaded an existing MIDX, however, we may not use all of its
packs, despite loading them into the ctx->info array.
The existing implementation simply skips past the first
ctx->m->num_packs (if ctx->m is non-NULL, indicating that we loaded an
existing MIDX). This is because we read objects in packs from an
existing MIDX via the MIDX itself, rather than from the pack-level
fanout to guarantee a de-duplicated result (see: a40498a126 (midx: use
existing midx when writing new one, 2018-07-12)).
Future changes (outside the scope of this patch series) to the MIDX code
will require us to skip *at most* that number[^1].
We could tag each pack with a bit that indicates the pack's contents
should be included in the MIDX. But we can just as easily determine the
number of packs to skip by passing in the number of packs we learned
about after processing an existing MIDX.
[^1]: Kind of. The real number will be bounded by the number of packs in
a MIDX layer, and the number of packs in its base layer(s), but that
concept hasn't been fully defined yet.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `midx-write.c::get_sorted_entries()` is responsible for
constructing the array of OIDs from a given list of packs which will
comprise the MIDX being written.
The singular call-site for this function looks something like:
ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr,
&ctx.entries_nr,
ctx.preferred_pack_idx);
This function has five formal arguments, all of which are members of the
shared `struct write_midx_context` used to track various pieces of
information about the MIDX being written.
The function `get_sorted_entries()` dates back to fe1ed56f5e (midx:
sort and deduplicate objects from packfiles, 2018-07-12), which came
shortly after 396f257018 (multi-pack-index: read packfile list,
2018-07-12). The latter patch introduced the `pack_list` structure,
which was a precursor to the structure we now know as
`write_midx_context` (c.f. 577dc49696 (midx: rename pack_info to
write_midx_context, 2021-02-18)).
At the time, `get_sorted_entries()` likely could have used the pack_list
structure introduced earlier in 396f257018, but understandably did not
since the structure only contained three fields (only two of which were
relevant to `get_sorted_entries()`) at the time.
Simplify the declaration of this function by taking a single pointer to
the whole `struct write_midx_context` instead of various members within
it. Since this function is now computing the entire result (populating
both `ctx->entries`, and `ctx->entries_nr`), rename it to something that
doesn't start with "get_" to make clear that this function has a
side-effect.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When passing a preferred pack to the MIDX write machinery, we ensure
that the given preferred pack is non-empty since 5d3cd09a80 (midx:
reject empty `--preferred-pack`'s, 2021-08-31).
However packs are only loaded (via `write_midx_internal()`, though a
subsequent patch will refactor this code out to its own function) when
the `MIDX_WRITE_REV_INDEX` flag is set.
So if a caller runs:
$ git multi-pack-index write --preferred-pack=...
with both (a) an existing MIDX, and (b) specifies a pack from that MIDX
as the preferred one, without passing `--bitmap`, then the check added
in 5d3cd09a80 will result in a segfault.
Note that packs loaded from disk which don't appear in an existing MIDX
do not trigger this issue, as those packs are loaded unconditionally. We
conditionally load packs from a MIDX since we tolerate MIDXs whose
packs do not resolve (i.e., via the MIDX write after removing
unreferenced packs via 'git multi-pack-index expire').
In practice, this isn't possible to trigger when running `git
multi-pack-index write` from `git repack`, as the latter always passes
`--stdin-packs`, which prevents us from loading an existing MIDX, as it
forces all packs to be read from disk.
But a future commit in this series will change that behavior to
unconditionally load an existing MIDX, even with `--stdin-packs`, making
this behavior trigger-able from 'repack' much more easily.
Prevent this from being an issue by removing the segfault altogether by
calling `prepare_midx_pack()` on packs loaded from an existing MIDX when
either the `MIDX_WRITE_REV_INDEX` flag is set *or* we specified a
`--preferred-pack`.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When constructing a new pack `git multi-pack-index repack` provides a
list of objects which is the union of objects in all MIDX'd packs which
were "included" in the repack.
Though correct, this typically yields a poorly structured pack, since
providing the objects list over stdin does not give pack-objects a
chance to discover the namehash values for each object, leading to
sub-optimal delta selection.
We can use `--stdin-packs` instead, which has a couple of benefits:
- it does a supplemental walk over objects in the supplied list of
packs to discover their namehash, leading to higher-quality delta
selection
- it requires us to list far less data over stdin; instead of listing
each object in the resulting pack, we need only list the
constituent packs from which those objects were selected in the MIDX
Of course, this comes at a slight cost: though we save time on listing
packs versus objects over stdin[^1] (around ~650 milliseconds), we add a
non-trivial amount of time walking over the given objects in order to
find better deltas.
In general, this is likely to more closely match the user's expectations
(i.e. that packs generated via `git multi-pack-index repack` are written
with high-quality deltas). But if not, we can always introduce a new
option in pack-objects to disable the supplemental object walk, which
would yield a pure CPU-time savings, at the cost of the on-disk size of
the resulting pack.
[^1]: In a patched version of Git that doesn't perform the supplemental
object walk in `pack-objects --stdin-packs`, we save around ~650ms
(from 5.968 to 5.325 seconds) when running `git multi-pack-index
repack --batch-size=0` on git.git with all objects packed, and all
packs in a MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In both fill_included_packs_all() and fill_included_packs_batch(), we
accumulate a list of packs whose contents we want to repack together,
and then use that information to feed a list of objects as input to
pack-objects.
In both cases, the `fill_included_packs_` functions keep track of how
many packs they want to repack together, and only execute pack-objects
if there are at least two packs that need repacking.
Having both of these functions keep track of this information themselves
is not strictly necessary, since they also log which packs to repack via
the `include_pack` array, so we can simply count the non-zero entries in
that array after either function is done executing, reducing the overall
amount of code necessary.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When performing a 'git multi-pack-index repack', the MIDX machinery
tries to aggregate MIDX'd packs together either to (a) fill the given
`--batch-size` argument, or (b) combine all packs together.
In either case (using the `midx-write.c::fill_included_packs_batch()` or
`midx-write.c::fill_included_packs_all()` function, respectively), we
evaluate whether or not we want to repack each MIDX'd pack, according to
whether or it is loadable, kept, cruft, or non-empty.
Between the two `fill_included_packs_` callers, they both care about the
same conditions, except for `fill_included_packs_batch()` which also
cares that the pack is non-empty.
We could extract two functions (say, `want_included_pack()` and a
`_nonempty()` variant), but this is not necessary. For the case in
`fill_included_packs_all()` which does not check the pack size, we add
all of the pack's objects assuming that the pack meets all other
criteria. But if the pack is empty in the first place, we add all of its
zero objects, so whether or not we "accept" or "reject" it in the first
place is irrelevant.
This change improves the readability in both `fill_included_packs_`
functions.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce a new midx-write.c source file, which holds all of the
functionality from the MIDX sub-system related to writing new MIDX files.
Similar to the relationship between "pack-bitmap.c" and
"pack-bitmap-write.c", this source file will hold code that is specific
to writing MIDX files as opposed to reading them (the latter will remain
in midx.c).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>