A double-free that may not trigger in practice by luck has been
corrected in the reference resolution code.
* sj/refs-symref-referent-fix:
ref-cache: fix invalid free operation in `free_ref_entry`
"git fsck" learned to issue warnings on "curiously formatted" ref
contents that have always been taken valid but something Git
wouldn't have written itself (e.g., missing terminating end-of-line
after the full object name).
* sj/ref-contents-check:
ref: add symlink ref content check for files backend
ref: check whether the target of the symref is a ref
ref: add basic symref content check for files backend
ref: add more strict checks for regular refs
ref: port git-fsck(1) regular refs check for files backend
ref: support multiple worktrees check for refs
ref: initialize ref name outside of check functions
ref: check the full refname instead of basename
ref: initialize "fsck_ref_report" with zero
In cfd971520e (refs: keep track of unresolved reference value in
iterators, 2024-08-09), we added a new field "referent" into the "struct
ref" structure. In order to free the "referent", we unconditionally
freed the "referent" by simply adding a "free" statement.
However, this is a bad usage. Because when ref entry is either directory
or loose ref, we will always execute the following statement:
free(entry->u.value.referent);
This does not make sense. We should never access the "entry->u.value"
field when "entry" is a directory. However, the change obviously doesn't
break the tests. Let's analysis why.
The anonymous union in the "ref_entry" has two members: one is "struct
ref_value", another is "struct ref_dir". On a 64-bit machine, the size
of "struct ref_dir" is 32 bytes, which is smaller than the 48-byte size
of "struct ref_value". And the offset of "referent" field in "struct
ref_value" is 40 bytes. So, whenever we create a new "ref_entry" for a
directory, we will leave the offset from 40 bytes to 48 bytes untouched,
which means the value for this memory is zero (NULL). It's OK to free a
NULL pointer, but this is merely a coincidence of memory layout.
To fix this issue, we now ensure that "free(entry->u.value.referent)" is
only called when "entry->flag" indicates that it represents a loose
reference and not a directory to avoid the invalid memory operation.
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Besides the textual symref, we also allow symbolic links as the symref.
So, we should also provide the consistency check as what we have done
for textual symref. And also we consider deprecating writing the
symbolic links. We first need to access whether symbolic links still
be used. So, add a new fsck message "symlinkRef(INFO)" to tell the
user be aware of this information.
We have already introduced "files_fsck_symref_target". We should reuse
this function to handle the symrefs which use legacy symbolic links. We
should not check the trailing garbage for symbolic refs. Add a new
parameter "symbolic_link" to disable some checks which should only be
executed for textual symrefs.
And we need to also generate the "referent" parameter for reusing
"files_fsck_symref_target" by the following steps:
1. Use "strbuf_add_real_path" to resolve the symlink and get the
absolute path "ref_content" which the symlink ref points to.
2. Generate the absolute path "abs_gitdir" of "gitdir" and combine
"ref_content" and "abs_gitdir" to extract the relative path
"relative_referent_path".
3. If "ref_content" is outside of "gitdir", we just set "referent" with
"ref_content". Instead, we set "referent" with
"relative_referent_path".
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ideally, we want to the users use "git symbolic-ref" to create symrefs
instead of writing raw contents into the filesystem. However, "git
symbolic-ref" is strict with the refname but not strict with the
referent. For example, we can make the "referent" located at the
"$(gitdir)/logs/aaa" and manually write the content into this where we
can still successfully parse this symref by using "git rev-parse".
$ git init repo && cd repo && git commit --allow-empty -mx
$ git symbolic-ref refs/heads/test logs/aaa
$ echo $(git rev-parse HEAD) > .git/logs/aaa
$ git rev-parse test
We may need to add some restrictions for "referent" parameter when using
"git symbolic-ref" to create symrefs because ideally all the
nonpseudo-refs should be located under the "refs" directory and we may
tighten this in the future.
In order to tell the user we may tighten the above situation, create
a new fsck message "symrefTargetIsNotARef" to notify the user that this
may become an error in the future.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have code that checks regular ref contents, but we do not yet check
the contents of symbolic refs. By using "parse_loose_ref_content" for
symbolic refs, we will get the information of the "referent".
We do not need to check the "referent" by opening the file. This is
because if "referent" exists in the file system, we will eventually
check its correctness by inspecting every file in the "refs" directory.
If the "referent" does not exist in the filesystem, this is OK as it is
seen as the dangling symref.
So we just need to check the "referent" string content. A regular ref
could be accepted as a textual symref if it begins with "ref:", followed
by zero or more whitespaces, followed by the full refname, followed only
by whitespace characters. However, we always write a single SP after
"ref:" and a single LF after the refname. It may seem that we should
report a fsck error message when the "referent" does not apply above
rules and we should not be so aggressive because third-party
reimplementations of Git may have taken advantage of the looser syntax.
Put it more specific, we accept the following contents:
1. "ref: refs/heads/master "
2. "ref: refs/heads/master \n \n"
3. "ref: refs/heads/master\n\n"
When introducing the regular ref content checks, we created two fsck
infos "refMissingNewline" and "trailingRefContent" which exactly
represents above situations. So we will reuse these two fsck messages to
write checks to info the user about these situations.
But we do not allow any other trailing garbage. The followings are bad
symref contents which will be reported as fsck error by "git-fsck(1)".
1. "ref: refs/heads/master garbage\n"
2. "ref: refs/heads/master \n\n\n garbage "
And we introduce a new "badReferentName(ERROR)" fsck message to report
above errors by using "is_root_ref" and "check_refname_format" to check
the "referent". Since both "is_root_ref" and "check_refname_format"
don't work with whitespaces, we use the trimmed version of "referent"
with these functions.
In order to add checks, we will do the following things:
1. Record the untrimmed length "orig_len" and untrimmed last byte
"orig_last_byte".
2. Use "strbuf_rtrim" to trim the whitespaces or newlines to make sure
"is_root_ref" and "check_refname_format" won't be failed by them.
3. Use "orig_len" and "orig_last_byte" to check whether the "referent"
misses '\n' at the end or it has trailing whitespaces or newlines.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have already used "parse_loose_ref_contents" function to check
whether the ref content is valid in files backend. However, by
using "parse_loose_ref_contents", we allow the ref's content to end with
garbage or without a newline.
Even though we never create such loose refs ourselves, we have accepted
such loose refs. So, it is entirely possible that some third-party tools
may rely on such loose refs being valid. We should not report an error
fsck message at current. We should notify the users about such
"curiously formatted" loose refs so that adequate care is taken before
we decide to tighten the rules in the future.
And it's not suitable either to report a warn fsck message to the user.
We don't yet want the "--strict" flag that controls this bit to end up
generating errors for such weirdly-formatted reference contents, as we
first want to assess whether this retroactive tightening will cause
issues for any tools out there. It may cause compatibility issues which
may break the repository. So, we add the following two fsck infos to
represent the situation where the ref content ends without newline or
has trailing garbages:
1. refMissingNewline(INFO): A loose ref that does not end with
newline(LF).
2. trailingRefContent(INFO): A loose ref has trailing content.
It might appear that we can't provide the user with any warnings by
using FSCK_INFO. However, in "fsck.c::fsck_vreport", we will convert
FSCK_INFO to FSCK_WARN and we can still warn the user about these
situations when using "git refs verify" without introducing
compatibility issues.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git-fsck(1)" implicitly checks the ref content by passing the
callback "fsck_handle_ref" to the "refs.c::refs_for_each_rawref".
Then, it will check whether the ref content (eventually "oid")
is valid. If not, it will report the following error to the user.
error: refs/heads/main: invalid sha1 pointer 0000...
And it will also report above errors when there are dangling symrefs
in the repository wrongly. This does not align with the behavior of
the "git symbolic-ref" command which allows users to create dangling
symrefs.
As we have already introduced the "git refs verify" command, we'd better
check the ref content explicitly in the "git refs verify" command thus
later we could remove these checks in "git-fsck(1)" and launch a
subprocess to call "git refs verify" in "git-fsck(1)" to make the
"git-fsck(1)" more clean.
Following what "git-fsck(1)" does, add a similar check to "git refs
verify". Then add a new fsck error message "badRefContent(ERROR)" to
represent that a ref has an invalid content.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We have already set up the infrastructure to check the consistency for
refs, but we do not support multiple worktrees. However, "git-fsck(1)"
will check the refs of worktrees. As we decide to get feature parity
with "git-fsck(1)", we need to set up support for multiple worktrees.
Because each worktree has its own specific refs, instead of just showing
the users "refs/worktree/foo", we need to display the full name such as
"worktrees/<id>/refs/worktree/foo". So we should know the id of the
worktree to get the full name. Add a new parameter "struct worktree *"
for "refs-internal.h::fsck_fn". Then change the related functions to
follow this new interface.
The "packed-refs" only exists in the main worktree, so we should only
check "packed-refs" in the main worktree. Use "is_main_worktree" method
to skip checking "packed-refs" in "packed_fsck" function.
Then, enhance the "files-backend.c::files_fsck_refs_dir" function to add
"worktree/<id>/" prefix when we are not in the main worktree.
Last, add a new test to check the refname when there are multiple
worktrees to exercise the code.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We passes "refs_check_dir" to the "files_fsck_refs_name" function which
allows it to create the checked ref name later. However, when we
introduce a new check function, we have to allocate redundant memory and
re-calculate the ref name. It's bad for us to allocate redundant memory
and duplicate logic. Instead, we should allocate and calculate it only
once and pass the ref name to the check functions.
In order not to do repeat calculation, rename "refs_check_dir" to
"refname". And in "files_fsck_refs_dir", create a new strbuf "refname",
thus whenever we handle a new ref, calculate the name and call the check
functions one by one.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In "files-backend.c::files_fsck_refs_name", we validate the refname
format by using "check_refname_format" to check the basename of the
iterator with "REFNAME_ALLOW_ONELEVEL" flag.
However, this is a bad implementation. Although we doesn't allow a
single "@" in ".git" directory, we do allow "refs/heads/@". So, we will
report an error wrongly when there is a "refs/heads/@" ref by using one
level refname "@".
Because we just check one level refname, we either cannot check the
other parts of the full refname. And we will ignore the following
errors:
"refs/heads/ new-feature/test"
"refs/heads/~new-feature/test"
In order to fix the above problem, enhance "files_fsck_refs_name" to use
the full name for "check_refname_format". Then, replace the tests which
are related to "@" and add tests to exercise the above situations using
for loop to avoid repetition.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In "fsck.c::fsck_refs_error_function", we need to tell whether "oid" and
"referent" is NULL. So, we need to always initialize these parameters to
NULL instead of letting them point to anywhere when creating a new
"fsck_ref_report" structure.
The original code explicitly initializes the "path" member in the
"struct fsck_ref_report" to NULL (which implicitly 0-initializes other
members in the struct). It is more customary to use "{ 0 }" to express
that we are 0-initializing everything. In order to align with the
codebase, initialize "fsck_ref_report" with zero.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reference transactions use `refs_verify_refname_available()` to check
for colliding references. This check consists of two parts:
- Checks for whether multiple ref updates in the same transaction
conflict with each other.
- Checks for whether existing refs conflict with any refs part of the
transaction.
While we generally cannot avoid the first check, the second check is
superfluous in cases where the transaction is an initial one in an
otherwise empty ref store. The check results in multiple ref reads as
well as the creation of a ref iterator for every ref we're checking,
which adds up quite fast when performing the check for many refs.
Introduce a new flag that allows us to skip this check and wire it up in
such that the backends pass it when running an initial transaction. This
leads to significant speedups when migrating ref storage backends. From
"files" to "reftable":
Benchmark 1: migrate files:reftable (refcount = 100000, revision = HEAD~)
Time (mean ± σ): 472.4 ms ± 6.7 ms [User: 175.9 ms, System: 285.2 ms]
Range (min … max): 463.5 ms … 483.2 ms 10 runs
Benchmark 2: migrate files:reftable (refcount = 100000, revision = HEAD)
Time (mean ± σ): 86.1 ms ± 1.9 ms [User: 67.9 ms, System: 16.0 ms]
Range (min … max): 82.9 ms … 90.9 ms 29 runs
Summary
migrate files:reftable (refcount = 100000, revision = HEAD) ran
5.48 ± 0.15 times faster than migrate files:reftable (refcount = 100000, revision = HEAD~)
And from "reftable" to "files":
Benchmark 1: migrate reftable:files (refcount = 100000, revision = HEAD~)
Time (mean ± σ): 452.7 ms ± 3.4 ms [User: 209.9 ms, System: 235.4 ms]
Range (min … max): 445.9 ms … 457.5 ms 10 runs
Benchmark 2: migrate reftable:files (refcount = 100000, revision = HEAD)
Time (mean ± σ): 95.2 ms ± 2.2 ms [User: 73.6 ms, System: 20.6 ms]
Range (min … max): 91.7 ms … 100.8 ms 28 runs
Summary
migrate reftable:files (refcount = 100000, revision = HEAD) ran
4.76 ± 0.11 times faster than migrate reftable:files (refcount = 100000, revision = HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "files" backend has implemented special logic when committing
the first transactions in an otherwise empty ref store: instead of
writing all refs as separate loose files, it instead knows to write them
all into a "packed-refs" file directly. This is significantly more
efficient than having to write each of the refs as separate "loose" ref.
The only user of this optimization is git-clone(1), which only uses this
mechanism to write regular refs. Consequently, the implementation does
not know how to handle both symbolic and root refs. While fine in the
context of git-clone(1), this keeps us from using the mechanism in more
cases.
Adapt the logic to also support symbolic and root refs by using a second
transaction that we use for all of the refs that need to be written as
loose refs.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are two different ways to commit a transaction:
- `ref_transaction_commit()` can be used to commit a regular
transaction and is what almost every caller wants.
- `initial_ref_transaction_commit()` can be used when it is known that
the ref store that the transaction is committed for is empty and
when there are no concurrent processes. This is used when cloning a
new repository.
Implementing this via two separate functions has a couple of downsides.
First, every reference backend needs to implement a separate callback
even in the case where they don't special-case the initial transaction.
Second, backends are basically forced to reimplement the whole logic for
how to commit the transaction like the "files" backend does, even though
backends may wish to only tweak certain behaviour of a "normal" commit.
Third, it is awkward that callers must never prepare the transaction as
this is somewhat different than how a transaction typically works.
Refactor the code such that we instead mark initial transactions via a
separate flag when starting the transaction. This addresses all of the
mentioned painpoints, where the most important part is that it will
allow backends to have way more leeway in how exactly they want to
handle the initial transaction.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move the logic to commit initial transactions such that we can start to
call it in `files_transaction_finish()` in a subsequent commit without
requiring a separate function declaration.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow passing flags when setting up a transaction such that the
behaviour of the transaction itself can be altered. This functionality
will be used in a subsequent patch.
Adapt callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Used regex to find these typos:
(?<!struct )(?<=\s)([a-z]{1,}) \1(?=\s)
Signed-off-by: Sven Strickroth <email@cs-ware.de>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
The reftable library is now prepared to expect that the memory
allocation function given to it may fail to allocate and to deal
with such an error.
* ps/reftable-alloc-failures: (26 commits)
reftable/basics: fix segfault when growing `names` array fails
reftable/basics: ban standard allocator functions
reftable: introduce `REFTABLE_FREE_AND_NULL()`
reftable: fix calls to free(3P)
reftable: handle trivial allocation failures
reftable/tree: handle allocation failures
reftable/pq: handle allocation failures when adding entries
reftable/block: handle allocation failures
reftable/blocksource: handle allocation failures
reftable/iter: handle allocation failures when creating indexed table iter
reftable/stack: handle allocation failures in auto compaction
reftable/stack: handle allocation failures in `stack_compact_range()`
reftable/stack: handle allocation failures in `reftable_new_stack()`
reftable/stack: handle allocation failures on reload
reftable/reader: handle allocation failures in `reader_init_iter()`
reftable/reader: handle allocation failures for unindexed reader
reftable/merged: handle allocation failures in `merged_table_init_iter()`
reftable/writer: handle allocation failures in `reftable_new_writer()`
reftable/writer: handle allocation failures in `writer_index_hash()`
reftable/record: handle allocation failures when decoding records
...
Handle allocation failures in `merged_table_init_iter()`. While at it,
merge `merged_iter_init()` into the function. It only has a single
caller and merging them makes it easier to handle allocation failures
consistently.
This change also requires us to adapt `reftable_stack_init_*_iterator()`
to bubble up the new error codes of `merged_table_iter_init()`. Adapt
callsites accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Give timeout to the locking code to write to reftable.
* ps/reftable-concurrent-writes:
refs/reftable: reload locked stack when preparing transaction
reftable/stack: allow locking of outdated stacks
refs/reftable: introduce "reftable.lockTimeout"
The reftable backend learned to more efficiently handle exclude
patterns while enumerating the refs.
* ps/reftable-exclude:
refs/reftable: wire up support for exclude patterns
reftable/reader: make table iterator reseekable
t/unit-tests: introduce reftable library
Makefile: stop listing test library objects twice
builtin/receive-pack: fix exclude patterns when announcing refs
refs: properly apply exclude patterns to namespaced refs
When starting a reftable transaction we lock all stacks we are about to
modify. While it may happen that the stack is out-of-date at this point
in time we don't really care: transactional updates encode the expected
state of a certain reference, so all that we really want to verify is
that the _current_ value matches that expected state.
Pass `REFTABLE_STACK_NEW_ADDITION_RELOAD` when locking the stack such
that an out-of-date stack will be reloaded after having been locked.
This change is safe because all verifications of the expected state
happen after this step anyway.
Add a testcase that verifies that many writers are now able to write to
the stack concurrently without failures and with a deterministic end
result.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In `reftable_stack_new_addition()` we first lock the stack and then
check whether it is still up-to-date. If it is not we return an error to
the caller indicating that the stack is outdated.
This is overly restrictive in our ref transaction interface though: we
lock the stack right before we start to verify the transaction, so we do
not really care whether it is outdated or not. What we really want is
that the stack is up-to-date after it has been locked so that we can
verify queued updates against its current state while we know that it is
locked for concurrent modification.
Introduce a new flag `REFTABLE_STACK_NEW_ADDITION_RELOAD` that alters
the behaviour of `reftable_stack_init_addition()` in this case: when we
notice that it is out-of-date we reload it instead of returning an error
to the caller.
This logic will be wired up in the reftable backend in the next commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When multiple concurrent processes try to update references in a
repository they may try to lock the same lockfiles. This can happen even
when the updates are non-conflicting and can both be applied, so it
doesn't always make sense to abort the transaction immediately. Both the
"loose" and "packed" backends thus have a grace period that they wait
for the lock to be released that can be controlled via the config values
"core.filesRefLockTimeout" and "core.packedRefsTimeout", respectively.
The reftable backend doesn't have such a setting yet and instead fails
immediately when it sees such a lock. But the exact same concepts apply
here as they do apply to the other backends.
Introduce a new "reftable.lockTimeout" config that controls how long we
may wait for a "tables.list" lock to be released. The default value of
this config is 100ms, which is the same default as we have it for the
"loose" backend.
Note that even though we also lock individual tables, this config really
only applies to the "tables.list" file. This is because individual
tables are only ever locked when we already hold the "tables.list" lock
during compaction. When we observe such a lock we in fact do not want to
compact the table at all because it is already in the process of being
compacted by a concurrent process. So applying the same timeout here
would not make any sense and only delay progress.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up.
* ps/environ-wo-the-repository: (21 commits)
environment: stop storing "core.notesRef" globally
environment: stop storing "core.warnAmbiguousRefs" globally
environment: stop storing "core.preferSymlinkRefs" globally
environment: stop storing "core.logAllRefUpdates" globally
refs: stop modifying global `log_all_ref_updates` variable
branch: stop modifying `log_all_ref_updates` variable
repo-settings: track defaults close to `struct repo_settings`
repo-settings: split out declarations into a standalone header
environment: guard state depending on a repository
environment: reorder header to split out `the_repository`-free section
environment: move `set_git_dir()` and related into setup layer
environment: make `get_git_namespace()` self-contained
environment: move object database functions into object layer
config: make dependency on repo in `read_early_config()` explicit
config: document `read_early_config()` and `read_very_early_config()`
environment: make `get_git_work_tree()` accept a repository
environment: make `get_graft_file()` accept a repository
environment: make `get_index_file()` accept a repository
environment: make `get_object_directory()` accept a repository
environment: make `get_git_common_dir()` accept a repository
...
Exclude patterns can be used by reference backends to skip over blocks
of references that are uninteresting to the caller. Reference backends
do not have to wire up support for them, and all callers are expected to
behave as if the backend didn't support them. In fact, the only backend
that supports exclude patterns right now is the "packed" backend.
Exclude patterns can be quite an important performance optimization in
repositories that have loads of references. The patterns are set up in
case "transfer.hideRefs" and friends are configured during a fetch, so
handling these patterns becomes important once there are lots of hidden
refs in a served repository.
Now that we have properly re-seekable reftable iterators we can also
wire up support for these patterns in the "reftable" backend. Doing so
is conceptually simple: once we hit a reference whose prefix matches the
current exclude pattern we re-seek the iterator to the first reference
that doesn't match the pattern anymore. This schema only works for
trivial patterns that do not have any globbing characters in them, but
this restriction also applies do the "packed" backend.
This makes t1419 work with the "reftable" backend with some slight
modifications. Of course it also speeds up listing of references with
hidden refs. The following benchmark prints one reference with 1 million
hidden references:
Benchmark 1: HEAD~
Time (mean ± σ): 93.3 ms ± 2.1 ms [User: 90.3 ms, System: 2.5 ms]
Range (min … max): 89.8 ms … 97.2 ms 33 runs
Benchmark 2: HEAD
Time (mean ± σ): 4.2 ms ± 0.6 ms [User: 2.2 ms, System: 1.8 ms]
Range (min … max): 3.1 ms … 8.1 ms 765 runs
Summary
HEAD ran
22.15 ± 3.19 times faster than HEAD~
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git pack-refs --auto" for the files backend was too aggressive,
which has been a bit tamed.
* ps/pack-refs-auto-heuristics:
refs/files: use heuristic to decide whether to repack with `--auto`
t0601: merge tests for auto-packing of refs
wrapper: introduce `log2u()`
Same as the preceding commit, storing the "core.preferSymlinkRefs" value
globally is misdesigned as this setting may be set per repository.
There is only a single user of this value anyway, namely the "files"
backend. So let's just remove the global variable and read the value of
this setting when initializing the backend.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The value of "core.logAllRefUpdates" is being stored in the global
variable `log_all_ref_updates`. This design is somewhat aged nowadays,
where it is entirely possible to access multiple repositories in the
same process which all have different values for this setting. So using
a single global variable to track it is plain wrong.
Remove the global variable. Instead, we now provide a new function part
of the repo-settings subsystem that parses the value for a specific
repository. While that may require us to read the value multiple times,
we work around this by reading it once when the ref backends are set up
and caching the value there.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In refs-related code we modify the global `log_all_ref_updates`
variable, which is done because `should_autocreate_reflog()` does not
accept passing an `enum log_refs_config` but instead accesses the global
variable. Adapt its interface such that the value is provided by the
caller, which allows us to compute the proper value locally without
having to modify global state.
This change requires us to move the enum to "repo-settings.h", or
otherwise we get compilation errors due to include cycles. We're about
to fully move this setting into the repo-settings subsystem anyway, so
this is fine.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In "environment.h" we have quite a lot of functions and variables that
either explicitly or implicitly depend on `the_repository`.
The implicit set of stateful declarations includes for example variables
which get populated when parsing a repository's Git configuration. This
set of variables is broken by design, as their state often depends on
the last repository config that has been parsed. So they may or may not
represent the state of `the_repository`.
Fixing that is quite a big undertaking, and later patches in this series
will demonstrate a solution for a first small set of those variables. So
for now, let's guard these with `USE_THE_REPOSITORY_VARIABLE` so that
callers are aware of the implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `--auto` flag for git-pack-refs(1) allows the ref backend to decide
whether or not a repack is in order. This switch has been introduced
mostly with the "reftable" backend in mind, which already knows to
auto-compact its tables during normal operations. When the flag is set,
then it will use the same auto-compaction mechanism and thus end up
doing nothing in most cases.
The "files" backend does not have any such heuristic yet and instead
packs any loose references unconditionally. So we rewrite the complete
"packed-refs" file even if there's only a single loose reference to be
packed.
Even worse, starting with 9f6714ab3e (builtin/gc: pack refs when using
`git maintenance run --auto`, 2024-03-25), `git pack-refs --auto` is
unconditionally executed via our auto maintenance, so we end up repacking
references every single time auto maintenance kicks in. And while that
commit already mentioned that the "files" backend unconditionally packs
refs now, the author obviously didn't quite think about the consequences
thereof. So while the idea was sound, we really should have added a
heuristic to the "files" backend before implementing it.
Introduce a heuristic that decides whether or not it is worth to pack
loose references. The important factors to decide here are the number of
loose references in comparison to the overall size of the "packed-refs"
file. The bigger the "packed-refs" file, the longer it takes to rewrite
it and thus we scale up the limit of allowed loose references before we
repack.
As is the nature of heuristics, this mechansim isn't obviously
"correct", but should rather be seen as a tradeoff between how much
resources we spend packing refs and how inefficient the ref store
becomes. For all I can say, we have successfully been using the exact
same heuristic in Gitaly for several years by now.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is needed to build things with -Werror=unused-parameter on a
platform without symbolic link support.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mark unused parameters as UNUSED to squelch -Wunused warnings.
* jk/mark-unused-parameters:
t-hashmap: stop calling setup() for t_intern() test
scalar: mark unused parameters in dummy function
daemon: mark unused parameters in non-posix fallbacks
setup: mark unused parameter in config callback
test-mergesort: mark unused parameters in trivial callback
t-hashmap: mark unused parameters in callback function
reftable: mark unused parameters in virtual functions
reftable: drop obsolete test function declarations
reftable: ignore unused argc/argv in test functions
unit-tests: ignore unused argc/argv
t/helper: mark more unused argv/argc arguments
oss-fuzz: mark unused argv/argc argument
refs: mark unused parameters in do_for_each_reflog_helper()
refs: mark unused parameters in ref_store fsck callbacks
update-ref: mark more unused parameters in parser callbacks
imap-send: mark unused parameter in ssl_socket_connect() fallback
Drop unused parameters from functions.
* jk/drop-unused-parameters:
diff-lib: drop unused index argument from get_stat_data()
ref-filter: drop unused parameters from email_atom_option_parser()
pack-bitmap: drop unused parameters from select_pseudo_merges()
pack-bitmap: load writer config from repository parameter
refs: drop some unused parameters from create_symref_lock()
Use of API functions that implicitly depend on the_repository
object in the config subsystem has been rewritten to pass a
repository object through the callchain.
* ps/config-wo-the-repository:
config: hide functions using `the_repository` by default
global: prepare for hiding away repo-less config functions
config: don't depend on `the_repository` with branch conditions
config: don't have setters depend on `the_repository`
config: pass repo to functions that rename or copy sections
config: pass repo to `git_die_config()`
config: pass repo to `git_config_get_expiry_in_days()`
config: pass repo to `git_config_get_expiry()`
config: pass repo to `git_config_get_max_percent_split_change()`
config: pass repo to `git_config_get_split_index()`
config: pass repo to `git_config_get_index_threads()`
config: expose `repo_config_clear()`
config: introduce missing setters that take repo as parameter
path: hide functions using `the_repository` by default
path: stop relying on `the_repository` in `worktree_git_path()`
path: stop relying on `the_repository` when reporting garbage
hooks: remove implicit dependency on `the_repository`
editor: do not rely on `the_repository` for interactive edits
path: expose `do_git_common_path()` as `repo_common_pathv()`
path: expose `do_git_path()` as `repo_git_pathv()`
The reftable code uses a lot of virtual function pointers, but many of
the concrete implementations do not need all of the parameters.
For the most part these are obviously fine to just mark as UNUSED (e.g.,
the empty_iterator functions unsurprisingly do not do anything). Here
are a few cases where I dug a little deeper (but still ended up just
marking them UNUSED):
- the iterator exclude_patterns is best-effort and optional (though it
would be nice to support in the long run as an optimization)
- ignoring the ref_store in many transaction functions is unexpected,
but works because the ref_transaction itself carries enough
information to do what we need.
- ignoring "err" for in some cases (e.g., transaction abort) is OK
because we do not return any errors. It is a little odd for
reftable_be_create_reflog(), though, since we do return errors
there. We should perhaps be creating string error messages at this
layer, but I've punted on that for now.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit ab6f79d8df (refs: set up ref consistency check infrastructure,
2024-08-08) added virtual functions to the ref store for doing fsck
checks. But the packed and reftable backends do not yet do anything.
Let's annotate them to silence -Wunused-parameter.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This function was factored out in 57d0b1e2ea (files-backend: extract out
`create_symref_lock()`, 2024-05-07), but we never look at the ref_store
or refname parameters. We just need the path, which is already contained
in the lockfile struct.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git fsck" infrastructure has been taught to also check the sanity
of the ref database, in addition to the object database.
* sj/ref-fsck:
fsck: add ref name check for files backend
files-backend: add unified interface for refs scanning
builtin/refs: add verify subcommand
refs: set up ref consistency check infrastructure
fsck: add refs report function
fsck: add a unified interface for reporting fsck messages
fsck: make "fsck_error" callback generic
fsck: rename objects-related fsck error functions
fsck: rename "skiplist" to "skip_oids"
The refs API has been taught to give symref target information to
the users of ref iterators, allowing for-each-ref and friends to
avoid an extra ref_resolve_* API call per a symbolic ref.
* jc/refs-symref-referent:
ref-filter: populate symref from iterator
refs: add referent to each_ref_fn
refs: keep track of unresolved reference value in iterators
We're about to hide config functions that implicitly depend on
`the_repository` behind the `USE_THE_REPOSITORY_VARIABLE` macro. This
will uncover a bunch of dependents that transitively relied on the
global variable, but didn't define the macro yet.
Adapt them such that we define the macro to prepare for this change.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add a parameter to each_ref_fn so that callers to the ref APIs
that use this function as a callback can have acess to the
unresolved value of a symbolic ref.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Since ref iterators do not hold onto the direct value of a reference
without resolving it, the only way to get ahold of a direct value of a
symbolic ref is to make a separate call to refs_read_symbolic_ref.
To make accessing the direct value of a symbolic ref more efficient,
let's save the direct value of the ref in the iterators for both the
files backend and the reftable backend.
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The git-fsck(1) only implicitly checks the reference, it does not fully
check refs with bad format name such as standalone "@".
However, a file ending with ".lock" should not be marked as having a bad
ref name. It is expected that concurrent writers may have such lock files.
We currently ignore this situation. But for bare ".lock" file, we will
report it as error.
In order to provide such checks, add a new fsck message id "badRefName"
with default ERROR type. Use existing "check_refname_format" to explicit
check the ref name. And add a new unit test to verify the functionality.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
For refs and reflogs, we need to scan its corresponding directories to
check every regular file or symbolic link which shares the same pattern.
Introduce a unified interface for scanning directories for
files-backend.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "struct ref_store" is the base class which contains the "be" pointer
which provides backend-specific functions whose interfaces are defined
in the "ref_storage_be". We could reuse this polymorphism to define only
one interface. For every backend, we need to provide its own function
pointer.
The interfaces defined in the `ref_storage_be` are carefully structured
in semantic. It's organized as the five parts:
1. The name and the initialization interfaces.
2. The ref transaction interfaces.
3. The ref internal interfaces (pack, rename and copy).
4. The ref filesystem interfaces.
5. The reflog related interfaces.
To keep consistent with the git-fsck(1), add a new interface named
"fsck_refs_fn" to the end of "ref_storage_be". This semantic cannot be
grouped into any above five categories. Explicitly add blank line to
make it different from others.
Last, implement placeholder functions for each ref backends.
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>