If you're going to access the contents of every object in a
packfile, it's generally much more efficient to do so in
pack order, rather than in hash order. That increases the
locality of access within the packfile, which in turn is
friendlier to the delta base cache, since the packfile puts
related deltas next to each other. By contrast, hash order
is effectively random, since the sha1 has no discernible
relationship to the content.
This patch introduces an "--unordered" option to cat-file
which iterates over packs in pack-order under the hood. You
can see the results when dumping all of the file content:
$ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
6883195596
real 0m44.491s
user 0m42.902s
sys 0m5.230s
$ time ./git cat-file --unordered \
--batch-all-objects --buffer --batch | wc -c
6883195596
real 0m6.075s
user 0m4.774s
sys 0m3.548s
Same output, different order, way faster. The same speed-up
applies even if you end up accessing the object content in a
different process, like:
git cat-file --batch-all-objects --buffer --batch-check |
grep blob |
git cat-file --batch='%(objectname) %(rest)' |
wc -c
Adding "--unordered" to the first command drops the runtime
in git.git from 24s to 3.5s.
Side note: there are actually further speedups available
for doing it all in-process now. Since we are outputting
the object content during the actual pack iteration, we
know where to find the object and could skip the extra
lookup done by oid_object_info(). This patch stops short
of that optimization since the underlying API isn't ready
for us to make those sorts of direct requests.
So if --unordered is so much better, why not make it the
default? Two reasons:
1. We've promised in the documentation that --batch-all-objects
outputs in hash order. Since cat-file is plumbing,
people may be relying on that default, and we can't
change it.
2. It's actually _slower_ for some cases. We have to
compute the pack revindex to walk in pack order. And
our de-duplication step uses an oidset, rather than a
sort-and-dedup, which can end up being more expensive.
If we're just accessing the type and size of each
object, for example, like:
git cat-file --batch-all-objects --buffer --batch-check
my best-of-five warm cache timings go from 900ms to
1100ms using --unordered. Though it's possible in a
cold-cache or under memory pressure that we could do
better, since we'd have better locality within the
packfile.
And one final question: why is it "--unordered" and not
"--pack-order"? The answer is again two-fold:
1. "pack order" isn't a well-defined thing across the
whole set of objects. We're hitting loose objects, as
well as objects in multiple packs, and the only
ordering we're promising is _within_ a single pack. The
rest is apparently random.
2. The point here is optimization. So we don't want to
promise any particular ordering, but only to say that
we will choose an ordering which is likely to be
efficient for accessing the object content. That leaves
the door open for further changes in the future without
having to add another compatibility option.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We're not really doing the batch-show operation in these
callbacks, but just collecting the set of objects. That
distinction will become more important in a future patch, so
let's rename them now to avoid cluttering that diff.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The recursive merge strategy did not properly ensure there was no
change between HEAD and the index before performing its operation,
which has been corrected.
* en/dirty-merge-fixes:
merge: fix misleading pre-merge check documentation
merge-recursive: enforce rule that index matches head before merging
t6044: add more testcases with staged changes before a merge is invoked
merge-recursive: fix assumption that head tree being merged is HEAD
merge-recursive: make sure when we say we abort that we actually abort
t6044: add a testcase for index matching head, when head doesn't match HEAD
t6044: verify that merges expected to abort actually abort
index_has_changes(): avoid assuming operating on the_index
read-cache.c: move index_has_changes() from merge.c
"git rebase --rebase-merges" mode now handles octopus merges as
well.
* js/rebase-merge-octopus:
rebase --rebase-merges: adjust man page for octopus support
rebase --rebase-merges: add support for octopus merges
merge: allow reading the merge commit message from a file
"git gc --auto" opens file descriptors for the packfiles before
spawning "git repack/prune", which would upset Windows that does
not want a process to work on a file that is open by another
process. The issue has been worked around.
* kg/gc-auto-windows-workaround:
gc --auto: release pack files before auto packing
For a large tree, the index needs to hold many cache entries
allocated on heap. These cache entries are now allocated out of a
dedicated memory pool to amortize malloc(3) overhead.
* jm/cache-entry-from-mem-pool:
block alloc: add validations around cache_entry lifecyle
block alloc: allocate cache entries from mem_pool
mem-pool: fill out functionality
mem-pool: add life cycle management functions
mem-pool: only search head block for available space
block alloc: add lifecycle APIs for cache_entry structs
read-cache: teach make_cache_entry to take object_id
read-cache: teach refresh_cache_entry to take istate
"git fetch" learned a new option "--negotiation-tip" to limit the
set of commits it tells the other end as "have", to reduce wasted
bandwidth and cycles, which would be helpful when the receiving
repository has a lot of refs that have little to do with the
history at the remote it is fetching from.
* jt/fetch-nego-tip:
fetch-pack: support negotiation tip whitelist
Parsing of -L[<N>][,[<M>]] parameters "git blame" and "git log"
take has been tweaked.
* is/parsing-line-range:
log: prevent error if line range ends past end of file
blame: prevent error if range ends past end of file
"git checkout" and "git worktree add" learned to honor
checkout.defaultRemote when auto-vivifying a local branch out of a
remote tracking branch in a repository with multiple remotes that
have tracking branches that share the same names.
* ab/checkout-default-remote:
checkout & worktree: introduce checkout.defaultRemote
checkout: add advice for ambiguous "checkout <branch>"
builtin/checkout.c: use "ret" variable for return
checkout: pass the "num_matches" up to callers
checkout.c: change "unique" member to "num_matches"
checkout.c: introduce an *_INIT macro
checkout.h: wrap the arguments to unique_tracking_name()
checkout tests: index should be clean after dwim checkout
"git fsck" learns to make sure the optional commit-graph file is in
a sane state.
* ds/commit-graph-fsck: (23 commits)
coccinelle: update commit.cocci
commit-graph: update design document
gc: automatically write commit-graph files
commit-graph: add '--reachable' option
commit-graph: use string-list API for input
fsck: verify commit-graph
commit-graph: verify contents match checksum
commit-graph: test for corrupted octopus edge
commit-graph: verify commit date
commit-graph: verify generation number
commit-graph: verify parent list
commit-graph: verify root tree OIDs
commit-graph: verify objects exist
commit-graph: verify corrupt OID fanout and lookup
commit-graph: verify required chunks are present
commit-graph: verify catches corrupt signature
commit-graph: add 'verify' subcommand
commit-graph: load a root tree from specific graph
commit: force commit to parse from object database
commit-graph: parse commit from chosen graph
...
Conversion from uchar[40] to struct object_id continues.
* bc/object-id:
pretty: switch hard-coded constants to the_hash_algo
sha1-file: convert constants to uses of the_hash_algo
log-tree: switch GIT_SHA1_HEXSZ to the_hash_algo->hexsz
diff: switch GIT_SHA1_HEXSZ to use the_hash_algo
builtin/merge-recursive: make hash independent
builtin/merge: switch to use the_hash_algo
builtin/fmt-merge-msg: make hash independent
builtin/update-index: simplify parsing of cacheinfo
builtin/update-index: convert to using the_hash_algo
refs/files-backend: use the_hash_algo for writing refs
sha1-name: use the_hash_algo when parsing object names
strbuf: allocate space with GIT_MAX_HEXSZ
commit: express tree entry constants in terms of the_hash_algo
hex: switch to using the_hash_algo
tree-walk: replace hard-coded constants with the_hash_algo
cache: update object ID functions for the_hash_algo
Code clean-up.
* hs/push-cert-check-cleanup:
gpg-interface: make parse_gpg_output static and remove from interface header
builtin/receive-pack: use check_signature from gpg-interface
Partial clone support of "git clone" has been updated to correctly
validate the objects it receives from the other side. The server
side has been corrected to send objects that are directly
requested, even if they may match the filtering criteria (e.g. when
doing a "lazy blob" partial clone).
* jt/partial-clone-fsck-connectivity:
clone: check connectivity even if clone is partial
upload-pack: send refs' objects despite "filter"
"git fetch" failed to correctly validate the set of objects it
received when making a shallow history deeper, which has been
corrected.
* jt/connectivity-check-after-unshallow:
fetch-pack: write shallow, then check connectivity
fetch-pack: implement ref-in-want
fetch-pack: put shallow info in output parameter
fetch: refactor to make function args narrower
fetch: refactor fetch_refs into two functions
fetch: refactor the population of peer ref OIDs
upload-pack: test negotiation with changing repository
upload-pack: implement ref-in-want
test-pkt-line: add unpack-sideband subcommand
Tighten the API to make it harder to misuse in-tree .gitmodules
file, even though it shares the same syntax with configuration
files, to read random configuration items from it.
* ao/config-from-gitmodules:
submodule-config: reuse config_from_gitmodules in repo_read_gitmodules
submodule-config: pass repository as argument to config_from_gitmodules
submodule-config: make 'config_from_gitmodules' private
submodule-config: add helper to get 'update-clone' config from .gitmodules
submodule-config: add helper function to get 'fetch' config from .gitmodules
config: move config_from_gitmodules to submodule-config.c
The "-l" option in "git branch -l" is an unfortunate short-hand for
"--create-reflog", but many users, both old and new, somehow expect
it to be something else, perhaps "--list". This step warns when "-l"
is used as a short-hand for "--create-reflog" and warns about the
future repurposing of the it when it is used.
* jk/branch-l-0-deprecation:
branch: deprecate "-l" option
t: switch "branch -l" to "branch --create-reflog"
t3200: unset core.logallrefupdates when testing reflog creation
"git grep" learned the "--column" option that gives not just the
line number but the column number of the hit.
* tb/grep-column:
contrib/git-jump/git-jump: jump to exact location
grep.c: add configuration variables to show matched option
builtin/grep.c: add '--column' option to 'git-grep(1)'
grep.c: display column number of first match
grep.[ch]: extend grep_opt to allow showing matched column
grep.c: expose {,inverted} match column in match_line()
Documentation/config.txt: camel-case lineNumber for consistency
The effort to move globals to per-repository in-core structure
continues.
* jt/remove-pack-bitmap-global:
pack-bitmap: add free function
pack-bitmap: remove bitmap_git global variable
Recently added "--base" option to "git format-patch" command did
not correctly generate prereq patch ids.
* xy/format-patch-prereq-patch-id-fix:
format-patch: clear UNINTERESTING flag before prepare_bases
"git submodule" did not correctly adjust core.worktree setting that
indicates whether/where a submodule repository has its associated
working tree across various state transitions, which has been
corrected.
* sb/submodule-core-worktree:
submodule deinit: unset core.worktree
submodule: ensure core.worktree is set after update
submodule: unset core.worktree if no working tree is present
The conversion to pass "the_repository" and then "a_repository"
throughout the object access API continues.
* sb/object-store-grafts:
commit: allow lookup_commit_graft to handle arbitrary repositories
commit: allow prepare_commit_graft to handle arbitrary repositories
shallow: migrate shallow information into the object parser
path.c: migrate global git_path_* to take a repository argument
cache: convert get_graft_file to handle arbitrary repositories
commit: convert read_graft_file to handle arbitrary repositories
commit: convert register_commit_graft to handle arbitrary repositories
commit: convert commit_graft_pos() to handle arbitrary repositories
shallow: add repository argument to is_repository_shallow
shallow: add repository argument to check_shallow_file_for_update
shallow: add repository argument to register_shallow
shallow: add repository argument to set_alternate_shallow_file
commit: add repository argument to lookup_commit_graft
commit: add repository argument to prepare_commit_graft
commit: add repository argument to read_graft_file
commit: add repository argument to register_commit_graft
commit: add repository argument to commit_graft_pos
object: move grafts to object parser
object-store: move object access functions to object-store.h
Add a struct repository argument to the functions in commit-graph.h that
read the commit graph. (This commit does not affect functions that write
commit graphs.)
Because the commit graph functions can now read the commit graph of any
repository, the global variable core_commit_graph has been removed.
Instead, the config option core.commitGraph is now read on the first
time in a repository that a commit is attempted to be parsed using its
commit graph.
This commit includes a test that exercises the functionality on an
arbitrary repository that is not the_repository.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use GIT_MAX_HEXSZ instead of GIT_SHA1_HEXSZ for an allocation so that it
is sufficiently large. Switch a comparison to use the_hash_algo to
determine the length of a hex object ID.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert several uses of GIT_SHA1_HEXSZ into references to the_hash_algo.
Switch other uses into a use of parse_oid_hex and uses of its computed
pointer.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Switch from using get_oid_hex to parse_oid_hex to simplify pointer
operations and avoid the need for a hash-related constant.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Switch from using GIT_SHA1_HEXSZ to the_hash_algo to make the parsing of
the index information hash independent.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our color buffers are all COLOR_MAXLEN, which fits the
largest possible color. So we can never overflow the buffer
by copying an existing color. However, using strcpy() makes
it harder to audit the code-base for calls that _are_
problems. We should use something like xsnprintf(), which
shows the reader that we expect this never to fail (and
provides a run-time assertion if it does, just in case).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is consistent with `git commit` which, like `git merge`, supports
passing the commit message via `-m <msg>` and, unlike `git merge` before
this patch, via `-F <file>`.
It is useful to allow this for scripted use, or for the upcoming patch
to allow (re-)creating octopus merges in `git rebase --rebase-merges`.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The combination of verify_signed_buffer followed by parse_gpg_output is
available as check_signature. Use that instead of implementing it again.
Signed-off-by: Henning Schild <henning.schild@siemens.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`git merge-recursive` does a three-way merge between user-specified trees
base, head, and remote. Since the user is allowed to specify head, we can
not necesarily assume that head == HEAD.
Modify index_has_changes() to take an extra argument specifying the tree
to compare against. If NULL, it will compare to HEAD. We then use this
from merge-recursive to make sure we compare to the user-specified head.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach gc --auto to release pack files before auto packing the repository
to prevent failures when removing them.
Also teach the test 'fetching with auto-gc does not lock up' to complain
when it is no longer triggering an auto packing of the repository.
Fixes https://github.com/git-for-windows/git/issues/500
Signed-off-by: Kim Gybels <kgybels@infogroep.be>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach 'git grep --only-matching', a new option to only print the
matching part(s) of a line.
For instance, a line containing the following (taken from README.md:27):
(`man gitcvs-migration` or `git help cvs-migration` if git is
Is printed as follows:
$ git grep --line-number --column --only-matching -e git -- \
README.md | grep ":27"
README.md:27:7:git
README.md:27:16:git
README.md:27:38:git
The patch works mostly as one would expect, with the exception of a few
considerations that are worth mentioning here.
Like GNU grep, this patch ignores --only-matching when --invert (-v) is
given. There is a sensible answer here, but parity with the behavior of
other tools is preferred.
Because a line might contain more than one match, there are special
considerations pertaining to when to print line headers, newlines, and
how to increment the match column offset. The line header and newlines
are handled as a special case within the main loop to avoid polluting
the surrounding code with conditionals that have large blocks.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The commit that introduced the partial clone feature - 548719fbdc
("clone: partial clone", 2017-12-08) - excluded connectivity checks
for partial clones, but this also meant that it is possible for a clone
to succeed, yet not have all objects either present or promised.
Specifically, if cloning with --filter=blob:none from a repository that
has a tag pointing to a blob, and the blob is not sent in the packfile,
the clone will pass, even if the blob is not referenced by any tree in
the packfile.
Turn on connectivity checks for partial clone.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As reported here[0], Microsoft Visual Studio 2017.2 and "gcc -pedantic"
don't understand the forward declaration of an unsized static array.
They insist on an array size:
d:\git\src\builtin\config.c(70,46): error C2133: 'builtin_config_options': unknown size
The thread [1] explains that this is due to the single-pass nature of
old compilers.
To work around this error, introduce the forward-declared function
usage_builtin_config() instead that uses the array
builtin_config_options only after it has been defined.
Also use this function in all other places where usage_with_options() is
called with the same arguments.
[0]: https://github.com/git-for-windows/git/issues/1735
[1]: https://groups.google.com/forum/#!topic/comp.lang.c.moderated/bmiF2xMz51U
Fixes https://github.com/git-for-windows/git/issues/1735
Reported-By: Karen Huang (via GitHub)
Signed-off-by: Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
During negotiation, fetch-pack eventually reports as "have" lines all
commits reachable from all refs. Allow the user to restrict the commits
sent in this way by providing a whitelist of tips; only the tips
themselves and their ancestors will be sent.
Both globs and single objects are supported.
This feature is only supported for protocols that support connect or
stateless-connect (such as HTTP with protocol v2).
This will speed up negotiation when the repository has multiple
relatively independent branches (for example, when a repository
interacts with multiple repositories, such as with linux-next [1] and
torvalds/linux [2]), and the user knows which local branch is likely to
have commits in common with the upstream branch they are fetching.
[1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next/
[2] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When fetching, connectivity is checked after the shallow file is
updated. There are 2 issues with this: (1) the connectivity check is
only performed up to ancestors of existing refs (which is not thorough
enough if we were deepening an existing ref in the first place), and (2)
there is no rollback of the shallow file if the connectivity check
fails.
To solve (1), update the connectivity check to check the ancestry chain
completely in the case of a deepening fetch by refraining from passing
"--not --all" when invoking rev-list in connected.c.
To solve (2), have fetch_pack() perform its own connectivity check
before updating the shallow file. To support existing use cases in which
"git fetch-pack" is used to download objects without much regard as to
the connectivity of the resulting objects with respect to the existing
repository, the connectivity check is only done if necessary (that is,
the fetch is not a clone, and the fetch involves shallow/deepen
functionality). "git fetch" still performs its own connectivity check,
preserving correctness but sometimes performing redundant work. This
redundancy is mitigated by the fact that fetch_pack() reports if it has
performed a connectivity check itself, and if the transport supports
connect or stateless-connect, it will bubble up that report so that "git
fetch" knows not to perform the connectivity check in such a case.
This was noticed when a user tried to deepen an existing repository by
fetching with --no-shallow from a server that did not send all necessary
objects - the connectivity check as run by "git fetch" succeeded, but a
subsequent "git fsck" failed.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Modify index_has_changes() to take a struct istate* instead of just
operating on the_index. This is only a partial conversion, though,
because we call do_diff_cache() which implicitly assumes work is to be
done on the_index. Ongoing work is being done elsewhere to do the
remainder of the conversion, and thus is not duplicated here. Instead,
a simple check is put in place until that work is complete.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It has been observed that the time spent loading an index with a large
number of entries is partly dominated by malloc() calls. This change
is in preparation for using memory pools to reduce the number of
malloc() calls made to allocate cahce entries when loading an index.
Add an API to allocate and discard cache entries, abstracting the
details of managing the memory backing the cache entries. This commit
does actually change how memory is managed - this will be done in a
later commit in the series.
This change makes the distinction between cache entries that are
associated with an index and cache entries that are not associated with
an index. A main use of cache entries is with an index, and we can
optimize the memory management around this. We still have other cases
where a cache entry is not persisted with an index, and so we need to
handle the "transient" use case as well.
To keep the congnitive overhead of managing the cache entries, there
will only be a single discard function. This means there must be enough
information kept with the cache entry so that we know how to discard
them.
A summary of the main functions in the API is:
make_cache_entry: create cache entry for use in an index. Uses specified
parameters to populate cache_entry fields.
make_empty_cache_entry: Create an empty cache entry for use in an index.
Returns cache entry with empty fields.
make_transient_cache_entry: create cache entry that is not used in an
index. Uses specified parameters to populate
cache_entry fields.
make_empty_transient_cache_entry: create cache entry that is not used in
an index. Returns cache entry with
empty fields.
discard_cache_entry: A single function that knows how to discard a cache
entry regardless of how it was allocated.
Signed-off-by: Jameson Miller <jamill@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a repository argument to allow the callers of deref_tag
to be more specific about which repository to act on. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.
As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a repository argument to allow the callers of parse_tag_buffer
to be more specific about which repository to act on. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.
As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a repository argument to allow the callers of lookup_tag
to be more specific about which repository to act on. This is a small
mechanical change; it doesn't change the implementation to handle
repositories other than the_repository yet.
As with the previous commits, use a macro to catch callers passing a
repository other than the_repository at compile time.
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>