 a43a2e6c2a
			
		
	
	a43a2e6c2a
	
	
	
		
			
			The chunk-based file format is now an API in the code, but we should also take time to document it as a file format. Specifically, it matches the CHUNK LOOKUP sections of the commit-graph and multi-pack-index files, but there are some commonalities that should be grouped in this document. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
		
			
				
	
	
		
			117 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			117 lines
		
	
	
		
			5.2 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Chunk-based file formats
 | |
| ========================
 | |
| 
 | |
| Some file formats in Git use a common concept of "chunks" to describe
 | |
| sections of the file. This allows structured access to a large file by
 | |
| scanning a small "table of contents" for the remaining data. This common
 | |
| format is used by the `commit-graph` and `multi-pack-index` files. See
 | |
| link:technical/pack-format.html[the `multi-pack-index` format] and
 | |
| link:technical/commit-graph-format.html[the `commit-graph` format] for
 | |
| how they use the chunks to describe structured data.
 | |
| 
 | |
| A chunk-based file format begins with some header information custom to
 | |
| that format. That header should include enough information to identify
 | |
| the file type, format version, and number of chunks in the file. From this
 | |
| information, that file can determine the start of the chunk-based region.
 | |
| 
 | |
| The chunk-based region starts with a table of contents describing where
 | |
| each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
 | |
| where C is the number of chunks. Consider the following table:
 | |
| 
 | |
|   | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
 | |
|   |--------------------|------------------------|
 | |
|   | ID[0]              | OFFSET[0]              |
 | |
|   | ...                | ...                    |
 | |
|   | ID[C]              | OFFSET[C]              |
 | |
|   | 0x0000             | OFFSET[C+1]            |
 | |
| 
 | |
| Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
 | |
| Each integer is stored in network-byte order.
 | |
| 
 | |
| The chunk identifier `ID[i]` is a label for the data stored within this
 | |
| fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
 | |
| size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
 | |
| and `OFFSET[i]`. This requires that the chunk data appears contiguously
 | |
| in the same order as the table of contents.
 | |
| 
 | |
| The final entry in the table of contents must be four zero bytes. This
 | |
| confirms that the table of contents is ending and provides the offset for
 | |
| the end of the chunk-based data.
 | |
| 
 | |
| Note: The chunk-based format expects that the file contains _at least_ a
 | |
| trailing hash after `OFFSET[C+1]`.
 | |
| 
 | |
| Functions for working with chunk-based file formats are declared in
 | |
| `chunk-format.h`. Using these methods provide extra checks that assist
 | |
| developers when creating new file formats.
 | |
| 
 | |
| Writing chunk-based file formats
 | |
| --------------------------------
 | |
| 
 | |
| To write a chunk-based file format, create a `struct chunkfile` by
 | |
| calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
 | |
| caller is responsible for opening the `hashfile` and writing header
 | |
| information so the file format is identifiable before the chunk-based
 | |
| format begins.
 | |
| 
 | |
| Then, call `add_chunk()` for each chunk that is intended for write. This
 | |
| populates the `chunkfile` with information about the order and size of
 | |
| each chunk to write. Provide a `chunk_write_fn` function pointer to
 | |
| perform the write of the chunk data upon request.
 | |
| 
 | |
| Call `write_chunkfile()` to write the table of contents to the `hashfile`
 | |
| followed by each of the chunks. This will verify that each chunk wrote
 | |
| the expected amount of data so the table of contents is correct.
 | |
| 
 | |
| Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
 | |
| caller is responsible for finalizing the `hashfile` by writing the trailing
 | |
| hash and closing the file.
 | |
| 
 | |
| Reading chunk-based file formats
 | |
| --------------------------------
 | |
| 
 | |
| To read a chunk-based file format, the file must be opened as a
 | |
| memory-mapped region. The chunk-format API expects that the entire file
 | |
| is mapped as a contiguous memory region.
 | |
| 
 | |
| Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
 | |
| 
 | |
| After reading the header information from the beginning of the file,
 | |
| including the chunk count, call `read_table_of_contents()` to populate
 | |
| the `struct chunkfile` with the list of chunks, their offsets, and their
 | |
| sizes.
 | |
| 
 | |
| Extract the data information for each chunk using `pair_chunk()` or
 | |
| `read_chunk()`:
 | |
| 
 | |
| * `pair_chunk()` assigns a given pointer with the location inside the
 | |
|   memory-mapped file corresponding to that chunk's offset. If the chunk
 | |
|   does not exist, then the pointer is not modified.
 | |
| 
 | |
| * `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
 | |
|   with the appropriate initial pointer and size information. The function
 | |
|   is not called if the chunk does not exist. Use this method to read chunks
 | |
|   if you need to perform immediate parsing or if you need to execute logic
 | |
|   based on the size of the chunk.
 | |
| 
 | |
| After calling these methods, call `free_chunkfile()` to clear the
 | |
| `struct chunkfile` data. This will not close the memory-mapped region.
 | |
| Callers are expected to own that data for the timeframe the pointers into
 | |
| the region are needed.
 | |
| 
 | |
| Examples
 | |
| --------
 | |
| 
 | |
| These file formats use the chunk-format API, and can be used as examples
 | |
| for future formats:
 | |
| 
 | |
| * *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
 | |
|   in `commit-graph.c` for how the chunk-format API is used to write and
 | |
|   parse the commit-graph file format documented in
 | |
|   link:technical/commit-graph-format.html[the commit-graph file format].
 | |
| 
 | |
| * *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
 | |
|   in `midx.c` for how the chunk-format API is used to write and
 | |
|   parse the multi-pack-index file format documented in
 | |
|   link:technical/pack-format.html[the multi-pack-index file format].
 |