Commit Graph

215 Commits

Author SHA1 Message Date
9118e46e81 Merge branch 'cp/unit-test-reftable-record'
A test in reftable library has been rewritten using the unit test
framework.

* cp/unit-test-reftable-record:
  t-reftable-record: add tests for reftable_log_record_compare_key()
  t-reftable-record: add tests for reftable_ref_record_compare_name()
  t-reftable-record: add index tests for reftable_record_is_deletion()
  t-reftable-record: add obj tests for reftable_record_is_deletion()
  t-reftable-record: add log tests for reftable_record_is_deletion()
  t-reftable-record: add ref tests for reftable_record_is_deletion()
  t-reftable-record: add comparison tests for obj records
  t-reftable-record: add comparison tests for index records
  t-reftable-record: add comparison tests for ref records
  t-reftable-record: add reftable_record_cmp() tests for log records
  t: move reftable/record_test.c to the unit testing framework
2024-07-15 10:11:44 -07:00
7b472da915 Merge branch 'ps/use-the-repository'
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.

* ps/use-the-repository:
  hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
  t/helper: remove dependency on `the_repository` in "proc-receive"
  t/helper: fix segfault in "oid-array" command without repository
  t/helper: use correct object hash in partial-clone helper
  compat/fsmonitor: fix socket path in networked SHA256 repos
  replace-object: use hash algorithm from passed-in repository
  protocol-caps: use hash algorithm from passed-in repository
  oidset: pass hash algorithm when parsing file
  http-fetch: don't crash when parsing packfile without a repo
  hash-ll: merge with "hash.h"
  refs: avoid include cycle with "repository.h"
  global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
  hash: require hash algorithm in `empty_tree_oid_hex()`
  hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
  hash: make `is_null_oid()` independent of `the_repository`
  hash: convert `oidcmp()` and `oideq()` to compare whole hash
  global: ensure that object IDs are always padded
  hash: require hash algorithm in `oidread()` and `oidclr()`
  hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
  hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
2024-07-02 09:59:00 -07:00
ba9661b457 t: move reftable/record_test.c to the unit testing framework
reftable/record_test.c exercises the functions defined in
reftable/record.{c, h}. Migrate reftable/record_test.c to the
unit testing framework. Migration involves refactoring the tests
to use the unit testing framework instead of reftable's test
framework, and renaming the tests to fit unit-tests' naming scheme.

While at it, change the type of index variable 'i' to 'size_t'
from 'int'. This is because 'i' is used in comparison against
'ARRAY_SIZE(x)' which is of type 'size_t'.

Also, use set_hash() which is defined locally in the test file
instead of set_test_hash() which is defined by
reftable/test_framework.{c, h}. This is fine to do as both these
functions are similarly implemented, and
reftable/test_framework.{c, h} is not #included in the ported test.

Get rid of reftable_record_print() from the tests as well, because
it clutters the test framework's output and we have no way of
verifying the output.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
Acked-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-07-02 08:12:24 -07:00
4216329457 Merge branch 'ps/no-writable-strings'
Building with "-Werror -Wwrite-strings" is now supported.

* ps/no-writable-strings: (27 commits)
  config.mak.dev: enable `-Wwrite-strings` warning
  builtin/merge: always store allocated strings in `pull_twohead`
  builtin/rebase: always store allocated string in `options.strategy`
  builtin/rebase: do not assign default backend to non-constant field
  imap-send: fix leaking memory in `imap_server_conf`
  imap-send: drop global `imap_server_conf` variable
  mailmap: always store allocated strings in mailmap blob
  revision: always store allocated strings in output encoding
  remote-curl: avoid assigning string constant to non-const variable
  send-pack: always allocate receive status
  parse-options: cast long name for OPTION_ALIAS
  http: do not assign string constant to non-const field
  compat/win32: fix const-correctness with string constants
  pretty: add casts for decoration option pointers
  object-file: make `buf` parameter of `index_mem()` a constant
  object-file: mark cached object buffers as const
  ident: add casts for fallback name and GECOS
  entry: refactor how we remove items for delayed checkouts
  line-log: always allocate the output prefix
  line-log: stop assigning string constant to file parent buffer
  ...
2024-06-17 15:55:58 -07:00
40a163f217 Merge branch 'ps/ref-storage-migration'
A new command has been added to migrate a repository that uses the
files backend for its ref storage to use the reftable backend, with
limitations.

* ps/ref-storage-migration:
  builtin/refs: new command to migrate ref storage formats
  refs: implement logic to migrate between ref storage formats
  refs: implement removal of ref storages
  worktree: don't store main worktree twice
  reftable: inline `merged_table_release()`
  refs/files: fix NULL pointer deref when releasing ref store
  refs/files: extract function to iterate through root refs
  refs/files: refactor `add_pseudoref_and_head_entries()`
  refs: allow to skip creation of reflog entries
  refs: pass storage format to `ref_store_init()` explicitly
  refs: convert ref storage format to an enum
  setup: unset ref storage when reinitializing repository version
2024-06-17 15:55:55 -07:00
8a676bdc5c hash-ll: merge with "hash.h"
The "hash-ll.h" header was introduced via d1cbe1e6d8 (hash-ll.h: split
out of hash.h to remove dependency on repository.h, 2023-04-22) to make
explicit the split between hash-related functions that rely on the
global `the_repository`, and those that don't. This split is no longer
necessary now that we we have removed the reliance on `the_repository`.

Merge "hash-ll.h" back into "hash.h". This causes some code units to not
include "repository.h" anymore, which requires us to add some forward
declarations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14 10:26:33 -07:00
092b33da2b Merge branch 'ps/ref-storage-migration' into ps/use-the-repository
* ps/ref-storage-migration:
  builtin/refs: new command to migrate ref storage formats
  refs: implement logic to migrate between ref storage formats
  refs: implement removal of ref storages
  worktree: don't store main worktree twice
  reftable: inline `merged_table_release()`
  refs/files: fix NULL pointer deref when releasing ref store
  refs/files: extract function to iterate through root refs
  refs/files: refactor `add_pseudoref_and_head_entries()`
  refs: allow to skip creation of reflog entries
  refs: pass storage format to `ref_store_init()` explicitly
  refs: convert ref storage format to an enum
  setup: unset ref storage when reinitializing repository version
2024-06-13 09:39:08 -07:00
56346ba24e Merge branch 'cp/reftable-unit-test'
Basic unit tests for reftable have been reimplemented under the
unit test framework.

* cp/reftable-unit-test:
  t: improve the test-case for parse_names()
  t: add test for put_be16()
  t: move tests from reftable/record_test.c to the new unit test
  t: move tests from reftable/stack_test.c to the new unit test
  t: move reftable/basics_test.c to the unit testing framework
2024-06-12 13:37:14 -07:00
66f892bb07 reftable: cast away constness when assigning constants to records
The reftable records are used in multiple ways throughout the reftable
library. In many of those cases they merely act as input to a function
without getting modified by it at all. Most importantly, this happens
when writing records and when querying for records.

We rely on this in our tests and thus assign string constants to those
fields, which is about to generate warnings as those fields are of type
`char *`. While we could go through the process and instead allocate
those strings in all of our tests, this feels quite unnecessary.

Instead, add casts to `char *` for all of those strings. As this is part
of our tests, this also nicely serves as a demonstration that nothing
writes or frees those string constants, which would otherwise lead to
segfaults.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07 10:30:49 -07:00
b567004b4b global: improve const correctness when assigning string constants
We're about to enable `-Wwrite-strings`, which changes the type of
string constants to `const char[]`. Fix various sites where we assign
such constants to non-const variables.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-07 10:30:48 -07:00
b5d7db9e83 reftable: inline merged_table_release()
The function `merged_table_release()` releases a merged table, whereas
`reftable_merged_table_free()` releases a merged table and then also
free's its pointer. But all callsites of `merged_table_release()` are in
fact followed by `reftable_merged_table_free()`, which is redundant.

Inline `merged_table_release()` into `reftable_merged_table_free()` to
get rid of this redundance.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-06 09:04:32 -07:00
67ce50ba26 Merge branch 'ps/reftable-reusable-iterator'
Code clean-up to make the reftable iterator closer to be reusable.

* ps/reftable-reusable-iterator:
  reftable/merged: adapt interface to allow reuse of iterators
  reftable/stack: provide convenience functions to create iterators
  reftable/reader: adapt interface to allow reuse of iterators
  reftable/generic: adapt interface to allow reuse of iterators
  reftable/generic: move seeking of records into the iterator
  reftable/merged: simplify indices for subiterators
  reftable/merged: split up initialization and seeking of records
  reftable/reader: set up the reader when initializing table iterator
  reftable/reader: inline `reader_seek_internal()`
  reftable/reader: separate concerns of table iter and reftable reader
  reftable/reader: unify indexed and linear seeking
  reftable/reader: avoid copying index iterator
  reftable/block: use `size_t` to track restart point index
2024-05-30 14:15:12 -07:00
afe5b9e7ec t: move tests from reftable/record_test.c to the new unit test
common_prefix_size(), get_be24() and put_be24() are functions defined
in reftable/basics.{c, h}. Move the tests for these functions from
reftable/record_test.c to the newly ported test.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30 07:30:10 -07:00
f74e1865fe t: move tests from reftable/stack_test.c to the new unit test
parse_names() and names_equal() are functions defined in
reftable/basics.{c, h}. Move the tests for these functions from
reftable/stack_test.c to the newly ported test.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30 07:30:10 -07:00
b34116a30c t: move reftable/basics_test.c to the unit testing framework
reftable/basics_test.c exercise the functions defined in
reftable/basics.{c, h}. Migrate reftable/basics_test.c to the
unit testing framework. Migration involves refactoring the tests
to use the unit testing framework instead of reftable's test
framework.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Chandra Pratap <chandrapratap3519@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30 07:30:10 -07:00
369b84196e reftable/merged: adapt interface to allow reuse of iterators
Refactor the interfaces exposed by `struct reftable_merged_table` and
`struct merged_iter` such that they support iterator reuse. This is done
by separating initialization of the iterator and seeking on it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:19 -07:00
08efe69212 reftable/stack: provide convenience functions to create iterators
There exist a bunch of call sites in the reftable backend that want to
create iterators for a reftable stack. This is rather convoluted right
now, where you always have to go via the merged table. And it is about
to become even more convoluted when we split up iterator initialization
and seeking in the next commit.

Introduce convenience functions that allow the caller to create an
iterator from a reftable stack directly without going through the merged
table. Adapt callers accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:19 -07:00
0e7be2b3ea reftable/reader: adapt interface to allow reuse of iterators
Refactor the interfaces exposed by `struct reftable_reader` and `struct
table_iterator` such that they support iterator reuse. This is done by
separating initialization of the iterator and seeking on it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
d76f0d3f57 reftable/generic: adapt interface to allow reuse of iterators
Refactor the interfaces exposed by `struct reftable_table` and `struct
reftable_iterator` such that they support iterator reuse. This is done
by separating initialization of the iterator and seeking on it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
5bf96e0c39 reftable/generic: move seeking of records into the iterator
Reftable iterators are created by seeking on the parent structure of a
corresponding record. For example, to create an iterator for the merged
table you would call `reftable_merged_table_seek_ref()`. Most notably,
it is not posible to create an iterator and then seek it afterwards.

While this may be a bit easier to reason about, it comes with two
significant downsides. The first downside is that the logic to find
records is split up between the parent data structure and the iterator
itself. Conceptually, it is more straight forward if all that logic was
contained in a single place, which should be the iterator.

The second and more significant downside is that it is impossible to
reuse iterators for multiple seeks. Whenever you want to look up a
record, you need to re-create the whole infrastructure again, which is
quite a waste of time. Furthermore, it is impossible to optimize seeks,
such as when seeking the same record multiple times.

To address this, we essentially split up the concerns properly such that
the parent data structure is responsible for setting up the iterator via
a new `init_iter()` callback, whereas the iterator handles seeks via a
new `seek()` callback. This will eventually allow us to call `seek()` on
the iterator multiple times, where every iterator can potentially
optimize for certain cases.

Note that at this point in time we are not yet ready to reuse the
iterators. This will be left for a future patch series.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
701713a254 reftable/merged: simplify indices for subiterators
When seeking on a merged table, we perform the seek for each of the
subiterators. If the subiterator has the desired record we add it to the
priority queue, otherwise we skip it and don't add it to the stack of
subiterators hosted by the merged table.

The consequence of this is that the index of the subiterator in the
merged table does not necessarily correspond to the index of it in the
merged iterator. Next to being potentially confusing, it also means that
we won't easily be able to re-seek the merged iterator because we have
no clear connection between both of the data structures.

Refactor the code so that the index stays the same in both structures.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
e08f49a4f5 reftable/merged: split up initialization and seeking of records
To initialize a `struct merged_iter`, we need to seek all subiterators
to the wanted record and then add their results to the priority queue
used to sort the records. This logic is split up across two functions,
`merged_table_seek_record()` and `merged_iter_init()`. The scope of
these functions is somewhat weird though, where `merged_iter_init()` is
only responsible for adding the records of the subiterators to the
priority queue.

Clarify the scope of those functions such that `merged_iter_init()` is
only responsible for initializing the iterator's structure. Performing
the subiterator seeks are now part of `merged_table_seek_record()`.

This step is required to move seeking of records into the generic
`struct reftable_iterator` infrastructure.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
c82692f755 reftable/reader: set up the reader when initializing table iterator
All the seeking functions accept a `struct reftable_reader` as input
such that they can use the reader to look up the respective blocks.
Refactor the code to instead set up the reader as a member of `struct
table_iter` during initialization such that we don't have to pass the
reader on every single call.

This step is required to move seeking of records into the generic
`struct reftable_iterator` infrastructure.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
f1e3c12196 reftable/reader: inline reader_seek_internal()
We have both `reader_seek()` and `reader_seek_internal()`, where the
former function only exists so that we can exit early in case the given
table has no records of the sought-after type.

Merge these two functions into one.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
81a03a3236 reftable/reader: separate concerns of table iter and reftable reader
In "reftable/reader.c" we implement two different interfaces:

  - The reftable reader contains the logic to read reftables.

  - The table iterator is used to iterate through a single reftable read
    by the reader.

The way those two types are used in the code is somewhat confusing
though because seeking inside a table is implemented as if it was part
of the reftable reader, even though it is ultimately more of a detail
implemented by the table iterator.

Make the boundary between those two types clearer by renaming functions
that seek records in a table such that they clearly belong to the table
iterator's logic.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
dfdd1455bb reftable/reader: unify indexed and linear seeking
In `reader_seek_internal()` we either end up doing an indexed seek when
there is one or a linear seek otherwise. These two code paths are
disjunct without a good reason, where the indexed seek will cause us to
exit early.

Refactor the two code paths such that it becomes possible to share a bit
more code between them.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:16 -07:00
9a59b65dba reftable/reader: avoid copying index iterator
When doing an indexed seek we need to walk down the multi-level index
until we finally hit a record of the desired indexed type. This loop
performs a copy of the index iterator on every iteration, which is both
hard to understand and completely unnecessary.

Refactor the code so that we use a single iterator to walk down the
indices, only.

Note that while this should improve performance, the improvement is
negligible in all but the most unreasonable repositories. This is
because the effect is only really noticeable when we have to walk down
many levels of indices, which is not something that a repository would
typically have. So the motivation for this change is really only about
readability.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:16 -07:00
d537ce6b9e reftable/block: use size_t to track restart point index
The function `block_reader_restart_offset()` gets the offset of the
`i`th restart point. `i` is a signed integer though, which is certainly
not the correct type to track indices like this. Furthermore, both
callers end up passing a `size_t`.

Refactor the code to use a `size_t` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:16 -07:00
f663d34306 reftable: make the compaction factor configurable
When auto-compacting, the reftable library packs references such that
the sizes of the tables form a geometric sequence. The factor for this
geometric sequence is hardcoded to 2 right now. We're about to expose
this as a config option though, so let's expose the factor via write
options.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:39 -07:00
8e9e136d61 reftable: use uint16_t to track restart interval
The restart interval can at most be `UINT16_MAX` as specified in the
technical documentation of the reftable format. Furthermore, it cannot
ever be negative. Regardless of that we use an `int` to track the
restart interval.

Change the type to use an `uint16_t` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
fcf341890e reftable/dump: support dumping a table's block structure
We're about to introduce new configs that will allow users to have more
control over how exactly reftables are written. To verify that these
configs are effective we will need to take a peak into the actual blocks
written by the reftable backend.

Introduce a new mode to the dumping logic that prints out the block
structure. This logic can be invoked via `test-tool dump-reftables -b`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
c22d75b027 reftable/writer: improve error when passed an invalid block size
The reftable format only supports block sizes up to 16MB. When the
writer is being passed a value bigger than that it simply calls
abort(3P), which isn't all that helpful due to the lack of a proper
error message.

Improve this by calling `BUG()` instead.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
e0cf3d8f8b reftable/writer: drop static variable used to initialize strbuf
We have a static variable in the reftable writer code that is merely
used to initialize the `last_key` of the writer. Convert the code to
instead use `strbuf_init()` and drop the variable.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
799237852b reftable: pass opts as constant pointer
We sometimes pass the refatble write options as value and sometimes as a
pointer. This is quite confusing and makes the reader wonder whether the
options get modified sometimes.

In fact, `reftable_new_writer()` does cause the caller-provided options
to get updated when some values aren't set up. This is quite unexpected,
but didn't cause any harm until now.

Adapt the code so that we do not modify the caller-provided values
anymore. While at it, refactor the code to code to consistently pass the
options as a constant pointer to clarify that the caller-provided opts
will not ever get modified.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
4d35bb2aba reftable: consistently refer to reftable_write_options as opts
Throughout the reftable library the `reftable_write_options` are
sometimes referred to as `cfg` and sometimes as `opts`. Unify these to
consistently use `opts` to avoid confusion.

While at it, touch up the coding style a bit by removing unneeded braces
around one-line statements and newlines between variable declarations.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:37 -07:00
5aec7231c8 Merge branch 'ps/reftable-write-optim'
Code to write out reftable has seen some optimization and
simplification.

* ps/reftable-write-optim:
  reftable/block: reuse compressed array
  reftable/block: reuse zstream when writing log blocks
  reftable/writer: reset `last_key` instead of releasing it
  reftable/writer: unify releasing memory
  reftable/writer: refactorings for `writer_flush_nonempty_block()`
  reftable/writer: refactorings for `writer_add_record()`
  refs/reftable: don't recompute committer ident
  reftable: remove name checks
  refs/reftable: skip duplicate name checks
  refs/reftable: perform explicit D/F check when writing symrefs
  refs/reftable: fix D/F conflict error message on ref copy
2024-05-08 10:18:43 -07:00
33bbc21c92 Merge branch 'ps/reftable-block-iteration-optim'
The code to iterate over reftable blocks has seen some optimization
to reduce memory allocation and deallocation.

* ps/reftable-block-iteration-optim:
  reftable/block: avoid copying block iterators on seek
  reftable/block: reuse `zstream` state on inflation
  reftable/block: open-code call to `uncompress2()`
  reftable/block: reuse uncompressed blocks
  reftable/reader: iterate to next block in place
  reftable/block: move ownership of block reader into `struct table_iter`
  reftable/block: introduce `block_reader_release()`
  reftable/block: better grouping of functions
  reftable/block: merge `block_iter_seek()` and `block_reader_seek()`
  reftable/block: rename `block_reader_start()`
2024-04-23 11:52:37 -07:00
82a31ec324 Merge branch 'jt/reftable-geometric-compaction'
The strategy to compact multiple tables of reftables after many
operations accumulate many entries has been improved to avoid
accumulating too many tables uncollected.

* jt/reftable-geometric-compaction:
  reftable/stack: use geometric table compaction
  reftable/stack: add env to disable autocompaction
  reftable/stack: expose option to disable auto-compaction
2024-04-16 14:50:30 -07:00
9da5c992dd reftable/block: avoid copying block iterators on seek
When seeking a reftable record in a block we need to position the
iterator _before_ the sought-after record so that the next call to
`block_iter_next()` would yield that record. To achieve this, the loop
that performs the linear seeks to restore the previous position once it
has found the record.

This is done by advancing two `block_iter`s: one to check whether the
next record is our sought-after record, and one that we update after
every iteration. This of course involves quite a lot of copying and also
leads to needless memory allocations.

Refactor the code to get rid of the `next` iterator and the copying this
involves. Instead, we can restore the previous offset such that the call
to `next` will return the correct record.

Next to being simpler conceptually this also leads to a nice speedup.
The following benchmark parser 10k refs out of 100k existing refs via
`git-rev-list --no-walk`:

  Benchmark 1: rev-list: print many refs (HEAD~)
    Time (mean ± σ):     170.2 ms ±   1.7 ms    [User: 86.1 ms, System: 83.6 ms]
    Range (min … max):   166.4 ms … 180.3 ms    500 runs

  Benchmark 2: rev-list: print many refs (HEAD~)
    Time (mean ± σ):     161.6 ms ±   1.6 ms    [User: 78.1 ms, System: 83.0 ms]
    Range (min … max):   158.4 ms … 172.3 ms    500 runs

  Summary
    rev-list: print many refs (HEAD) ran
      1.05 ± 0.01 times faster than rev-list: print many refs (HEAD~)

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:37:59 -07:00
ce1f213cc9 reftable/block: reuse zstream state on inflation
When calling `inflateInit()` and `inflate()`, the zlib library will
allocate several data structures for the underlying `zstream` to keep
track of various information. Thus, when inflating repeatedly, it is
possible to optimize memory allocation patterns by reusing the `zstream`
and then calling `inflateReset()` on it to prepare it for the next chunk
of data to inflate.

This is exactly what the reftable code is doing: when iterating through
reflogs we need to potentially inflate many log blocks, but we discard
the `zstream` every single time. Instead, as we reuse the `block_reader`
for each of the blocks anyway, we can initialize the `zstream` once and
then reuse it for subsequent inflations.

Refactor the code to do so, which leads to a significant reduction in
the number of allocations. The following measurements were done when
iterating through 1 million reflog entries. Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
15a60b747e reftable/block: open-code call to uncompress2()
The reftable format stores log blocks in a compressed format. Thus,
whenever we want to read such a block we first need to decompress it.
This is done by calling the convenience function `uncompress2()` of the
zlib library, which is a simple wrapper that manages the lifecycle of
the `zstream` structure for us.

While nice for one-off inflation of data, when iterating through reflogs
we will likely end up inflating many such log blocks. This requires us
to reallocate the state of the `zstream` every single time, which adds
up over time. It would thus be great to reuse the `zstream` instead of
discarding it after every inflation.

Open-code the call to `uncompress2()` such that we can start reusing the
`zstream` in the subsequent commit. Note that our open-coded variant is
different from `uncompress2()` in two ways:

  - We do not loop around `inflate()` until we have processed all input.
    As our input is limited by the maximum block size, which is 16MB, we
    should not hit limits of `inflate()`.

  - We use `Z_FINISH` instead of `Z_NO_FLUSH`. Quoting the `inflate()`
    documentation: "inflate() should normally be called until it returns
    Z_STREAM_END or an error. However if all decompression is to be
    performed in a single step (a single call of inflate), the parameter
    flush should be set to Z_FINISH."

    Furthermore, "Z_FINISH also informs inflate to not maintain a
    sliding window if the stream completes, which reduces inflate's
    memory footprint."

Other than that this commit is expected to be functionally equivalent
and does not yet reuse the `zstream`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
dd347bbce6 reftable/block: reuse uncompressed blocks
The reftable backend stores reflog entries in a compressed format and
thus needs to uncompress blocks before one can read records from it.
For each reflog block we thus have to allocate an array that we can
decompress the block contents into. This block is being discarded
whenever the table iterator moves to the next block. Consequently, we
reallocate a new array on every block, which is quite wasteful.

Refactor the code to reuse the uncompressed block data when moving the
block reader to a new block. This significantly reduces the number of
allocations when iterating through many compressed blocks. The following
measurements are done with `git reflog list` when listing 100k reflogs.
Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 45,755 allocs, 45,633 frees, 254,779,456 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,547 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
b00bcb7c49 reftable/reader: iterate to next block in place
The table iterator has to iterate towards the next block once it has
yielded all records of the current block. This is done by creating a new
table iterator, initializing it to the next block, releasing the old
iterator and then copying over the data.

Refactor the code to instead advance the table iterator in place. This
is simpler and unlocks some optimizations in subsequent patches. Also,
it allows us to avoid some allocations.

The following measurements show a single matching ref out of 1 million
refs. Before this change:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 315 allocs, 190 frees, 107,027 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
bcdc586db0 reftable/block: move ownership of block reader into struct table_iter
The table iterator allows the caller to iterate through all records in a
reftable table. To do so it iterates through all blocks of the desired
type one by one, where for each block it creates a new block iterator
and yields all its entries.

One of the things that is somewhat confusing in this context is who owns
the block reader that is being used to read the blocks and pass them to
the block iterator. Intuitively, as the table iterator is responsible
for iterating through the blocks, one would assume that this iterator is
also responsible for managing the lifecycle of the reader. And while it
somewhat is, the block reader is ultimately stored inside of the block
iterator.

Refactor the code such that the block reader is instead fully managed by
the table iterator. Instead of passing the reader to the block iterator,
we now only end up passing the block data to it. Despite clearing up the
lifecycle of the reader, it will also allow for better reuse of the
reader in subsequent patches.

The following benchmark prints a single matching ref out of 1 million
refs. Before:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 6,607 allocs, 6,482 frees, 509,635 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

Note that while there are more allocation and free calls now, the
overall number of bytes allocated is significantly lower. The number of
allocations will be reduced significantly by the next patch though.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
b371221a60 reftable/block: introduce block_reader_release()
Introduce a new function `block_reader_release()` that releases
resources acquired by the block reader. This function will be extended
in a subsequent commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
aac8c03cc4 reftable/block: better grouping of functions
Function definitions and declaration of `struct block_reader` and
`struct block_iter` are somewhat mixed up, making it hard to see which
functions belong together. Rearrange them.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
42c7bdc36d reftable/block: merge block_iter_seek() and block_reader_seek()
The function `block_iter_seek()` is merely a simple wrapper around
`block_reader_seek()`. Merge those two functions into a new function
`block_iter_seek_key()` that more clearly says what it is actually
doing.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
3122d44025 reftable/block: rename block_reader_start()
The function `block_reader_start()` does not really apply to the block
reader, but to the block iterator. It's name is thus somewhat confusing.
Rename it to `block_iter_seek_start()` to clarify.

We will rename `block_reader_seek()` in similar spirit in the next
commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
7fbe3ead19 Merge branch 'ps/reftable-binsearch-updates'
Reftable code clean-up and some bugfixes.

* ps/reftable-binsearch-updates:
  reftable/block: avoid decoding keys when searching restart points
  reftable/record: extract function to decode key lengths
  reftable/block: fix error handling when searching restart points
  reftable/block: refactor binary search over restart points
  reftable/refname: refactor binary search over refnames
  reftable/basics: improve `binsearch()` test
  reftable/basics: fix return type of `binsearch()` to be `size_t`
2024-04-12 11:31:39 -07:00
eacfd581d2 Merge branch 'ps/pack-refs-auto'
"git pack-refs" learned the "--auto" option, which is a useful
addition to be triggered from "git gc --auto".

Acked-by: Karthik Nayak <karthik.188@gmail.com>
cf. <CAOLa=ZRAEA7rSUoYL0h-2qfEELdbPHbeGpgBJRqesyhHi9Q6WQ@mail.gmail.com>

* ps/pack-refs-auto:
  builtin/gc: pack refs when using `git maintenance run --auto`
  builtin/gc: forward git-gc(1)'s `--auto` flag when packing refs
  t6500: extract objects with "17" prefix
  builtin/gc: move `struct maintenance_run_opts`
  builtin/pack-refs: introduce new "--auto" flag
  builtin/pack-refs: release allocated memory
  refs/reftable: expose auto compaction via new flag
  refs: remove `PACK_REFS_ALL` flag
  refs: move `struct pack_refs_opts` to where it's used
  t/helper: drop pack-refs wrapper
  refs/reftable: print errors on compaction failure
  reftable/stack: gracefully handle failed auto-compaction due to locks
  reftable/stack: use error codes when locking fails during compaction
  reftable/error: discern locked/outdated errors
  reftable/stack: fix error handling in `reftable_stack_init_addition()`
2024-04-09 14:31:45 -07:00