Sync 'ds/multi-pack-index' to v2.19.0-rc0
* ds/multi-pack-index: (23 commits) midx: clear midx on repack packfile: skip loading index if in multi-pack-index midx: prevent duplicate packfile loads midx: use midx in approximate_object_count midx: use existing midx when writing new one midx: use midx in abbreviation calculations midx: read objects from multi-pack-index config: create core.multiPackIndex setting midx: write object offsets midx: write object id fanout chunk midx: write object ids in a chunk midx: sort and deduplicate objects from packfiles midx: read pack names into array multi-pack-index: write pack names in chunk multi-pack-index: read packfile list packfile: generalize pack directory list t5319: expand test data multi-pack-index: load into memory midx: write header information to lockfile multi-pack-index: add 'write' verb ...
This commit is contained in:
109
Documentation/technical/multi-pack-index.txt
Normal file
109
Documentation/technical/multi-pack-index.txt
Normal file
@ -0,0 +1,109 @@
|
||||
Multi-Pack-Index (MIDX) Design Notes
|
||||
====================================
|
||||
|
||||
The Git object directory contains a 'pack' directory containing
|
||||
packfiles (with suffix ".pack") and pack-indexes (with suffix
|
||||
".idx"). The pack-indexes provide a way to lookup objects and
|
||||
navigate to their offset within the pack, but these must come
|
||||
in pairs with the packfiles. This pairing depends on the file
|
||||
names, as the pack-index differs only in suffix with its pack-
|
||||
file. While the pack-indexes provide fast lookup per packfile,
|
||||
this performance degrades as the number of packfiles increases,
|
||||
because abbreviations need to inspect every packfile and we are
|
||||
more likely to have a miss on our most-recently-used packfile.
|
||||
For some large repositories, repacking into a single packfile
|
||||
is not feasible due to storage space or excessive repack times.
|
||||
|
||||
The multi-pack-index (MIDX for short) stores a list of objects
|
||||
and their offsets into multiple packfiles. It contains:
|
||||
|
||||
- A list of packfile names.
|
||||
- A sorted list of object IDs.
|
||||
- A list of metadata for the ith object ID including:
|
||||
- A value j referring to the jth packfile.
|
||||
- An offset within the jth packfile for the object.
|
||||
- If large offsets are required, we use another list of large
|
||||
offsets similar to version 2 pack-indexes.
|
||||
|
||||
Thus, we can provide O(log N) lookup time for any number
|
||||
of packfiles.
|
||||
|
||||
Design Details
|
||||
--------------
|
||||
|
||||
- The MIDX is stored in a file named 'multi-pack-index' in the
|
||||
.git/objects/pack directory. This could be stored in the pack
|
||||
directory of an alternate. It refers only to packfiles in that
|
||||
same directory.
|
||||
|
||||
- The pack.multiIndex config setting must be on to consume MIDX files.
|
||||
|
||||
- The file format includes parameters for the object ID hash
|
||||
function, so a future change of hash algorithm does not require
|
||||
a change in format.
|
||||
|
||||
- The MIDX keeps only one record per object ID. If an object appears
|
||||
in multiple packfiles, then the MIDX selects the copy in the most-
|
||||
recently modified packfile.
|
||||
|
||||
- If there exist packfiles in the pack directory not registered in
|
||||
the MIDX, then those packfiles are loaded into the `packed_git`
|
||||
list and `packed_git_mru` cache.
|
||||
|
||||
- The pack-indexes (.idx files) remain in the pack directory so we
|
||||
can delete the MIDX file, set core.midx to false, or downgrade
|
||||
without any loss of information.
|
||||
|
||||
- The MIDX file format uses a chunk-based approach (similar to the
|
||||
commit-graph file) that allows optional data to be added.
|
||||
|
||||
Future Work
|
||||
-----------
|
||||
|
||||
- Add a 'verify' subcommand to the 'git midx' builtin to verify the
|
||||
contents of the multi-pack-index file match the offsets listed in
|
||||
the corresponding pack-indexes.
|
||||
|
||||
- The multi-pack-index allows many packfiles, especially in a context
|
||||
where repacking is expensive (such as a very large repo), or
|
||||
unexpected maintenance time is unacceptable (such as a high-demand
|
||||
build machine). However, the multi-pack-index needs to be rewritten
|
||||
in full every time. We can extend the format to be incremental, so
|
||||
writes are fast. By storing a small "tip" multi-pack-index that
|
||||
points to large "base" MIDX files, we can keep writes fast while
|
||||
still reducing the number of binary searches required for object
|
||||
lookups.
|
||||
|
||||
- The reachability bitmap is currently paired directly with a single
|
||||
packfile, using the pack-order as the object order to hopefully
|
||||
compress the bitmaps well using run-length encoding. This could be
|
||||
extended to pair a reachability bitmap with a multi-pack-index. If
|
||||
the multi-pack-index is extended to store a "stable object order"
|
||||
(a function Order(hash) = integer that is constant for a given hash,
|
||||
even as the multi-pack-index is updated) then a reachability bitmap
|
||||
could point to a multi-pack-index and be updated independently.
|
||||
|
||||
- Packfiles can be marked as "special" using empty files that share
|
||||
the initial name but replace ".pack" with ".keep" or ".promisor".
|
||||
We can add an optional chunk of data to the multi-pack-index that
|
||||
records flags of information about the packfiles. This allows new
|
||||
states, such as 'repacked' or 'redeltified', that can help with
|
||||
pack maintenance in a multi-pack environment. It may also be
|
||||
helpful to organize packfiles by object type (commit, tree, blob,
|
||||
etc.) and use this metadata to help that maintenance.
|
||||
|
||||
- The partial clone feature records special "promisor" packs that
|
||||
may point to objects that are not stored locally, but available
|
||||
on request to a server. The multi-pack-index does not currently
|
||||
track these promisor packs.
|
||||
|
||||
Related Links
|
||||
-------------
|
||||
[0] https://bugs.chromium.org/p/git/issues/detail?id=6
|
||||
Chromium work item for: Multi-Pack Index (MIDX)
|
||||
|
||||
[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
|
||||
An earlier RFC for the multi-pack-index feature
|
||||
|
||||
[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
|
||||
Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
|
||||
@ -252,3 +252,80 @@ Pack file entry: <+
|
||||
corresponding packfile.
|
||||
|
||||
20-byte SHA-1-checksum of all of the above.
|
||||
|
||||
== multi-pack-index (MIDX) files have the following format:
|
||||
|
||||
The multi-pack-index files refer to multiple pack-files and loose objects.
|
||||
|
||||
In order to allow extensions that add extra data to the MIDX, we organize
|
||||
the body into "chunks" and provide a lookup table at the beginning of the
|
||||
body. The header includes certain length values, such as the number of packs,
|
||||
the number of base MIDX files, hash lengths and types.
|
||||
|
||||
All 4-byte numbers are in network order.
|
||||
|
||||
HEADER:
|
||||
|
||||
4-byte signature:
|
||||
The signature is: {'M', 'I', 'D', 'X'}
|
||||
|
||||
1-byte version number:
|
||||
Git only writes or recognizes version 1.
|
||||
|
||||
1-byte Object Id Version
|
||||
Git only writes or recognizes version 1 (SHA1).
|
||||
|
||||
1-byte number of "chunks"
|
||||
|
||||
1-byte number of base multi-pack-index files:
|
||||
This value is currently always zero.
|
||||
|
||||
4-byte number of pack files
|
||||
|
||||
CHUNK LOOKUP:
|
||||
|
||||
(C + 1) * 12 bytes providing the chunk offsets:
|
||||
First 4 bytes describe chunk id. Value 0 is a terminating label.
|
||||
Other 8 bytes provide offset in current file for chunk to start.
|
||||
(Chunks are provided in file-order, so you can infer the length
|
||||
using the next chunk position if necessary.)
|
||||
|
||||
The remaining data in the body is described one chunk at a time, and
|
||||
these chunks may be given in any order. Chunks are required unless
|
||||
otherwise specified.
|
||||
|
||||
CHUNK DATA:
|
||||
|
||||
Packfile Names (ID: {'P', 'N', 'A', 'M'})
|
||||
Stores the packfile names as concatenated, null-terminated strings.
|
||||
Packfiles must be listed in lexicographic order for fast lookups by
|
||||
name. This is the only chunk not guaranteed to be a multiple of four
|
||||
bytes in length, so should be the last chunk for alignment reasons.
|
||||
|
||||
OID Fanout (ID: {'O', 'I', 'D', 'F'})
|
||||
The ith entry, F[i], stores the number of OIDs with first
|
||||
byte at most i. Thus F[255] stores the total
|
||||
number of objects.
|
||||
|
||||
OID Lookup (ID: {'O', 'I', 'D', 'L'})
|
||||
The OIDs for all objects in the MIDX are stored in lexicographic
|
||||
order in this chunk.
|
||||
|
||||
Object Offsets (ID: {'O', 'O', 'F', 'F'})
|
||||
Stores two 4-byte values for every object.
|
||||
1: The pack-int-id for the pack storing this object.
|
||||
2: The offset within the pack.
|
||||
If all offsets are less than 2^31, then the large offset chunk
|
||||
will not exist and offsets are stored as in IDX v1.
|
||||
If there is at least one offset value larger than 2^32-1, then
|
||||
the large offset chunk must exist. If the large offset chunk
|
||||
exists and the 31st bit is on, then removing that bit reveals
|
||||
the row in the large offsets containing the 8-byte offset of
|
||||
this object.
|
||||
|
||||
[Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
|
||||
8-byte offsets into large packfiles.
|
||||
|
||||
TRAILER:
|
||||
|
||||
20-byte SHA1-checksum of the above contents.
|
||||
|
||||
Reference in New Issue
Block a user