 ae30d7b115
			
		
	
	ae30d7b115
	
	
	
		
			
			Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
		
			
				
	
	
		
			164 lines
		
	
	
		
			7.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			164 lines
		
	
	
		
			7.3 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Git Commit Graph Design Notes
 | |
| =============================
 | |
| 
 | |
| Git walks the commit graph for many reasons, including:
 | |
| 
 | |
| 1. Listing and filtering commit history.
 | |
| 2. Computing merge bases.
 | |
| 
 | |
| These operations can become slow as the commit count grows. The merge
 | |
| base calculation shows up in many user-facing commands, such as 'merge-base'
 | |
| or 'status' and can take minutes to compute depending on history shape.
 | |
| 
 | |
| There are two main costs here:
 | |
| 
 | |
| 1. Decompressing and parsing commits.
 | |
| 2. Walking the entire graph to satisfy topological order constraints.
 | |
| 
 | |
| The commit graph file is a supplemental data structure that accelerates
 | |
| commit graph walks. If a user downgrades or disables the 'core.commitGraph'
 | |
| config setting, then the existing ODB is sufficient. The file is stored
 | |
| as "commit-graph" either in the .git/objects/info directory or in the info
 | |
| directory of an alternate.
 | |
| 
 | |
| The commit graph file stores the commit graph structure along with some
 | |
| extra metadata to speed up graph walks. By listing commit OIDs in lexi-
 | |
| cographic order, we can identify an integer position for each commit and
 | |
| refer to the parents of a commit using those integer positions. We use
 | |
| binary search to find initial commits and then use the integer positions
 | |
| for fast lookups during the walk.
 | |
| 
 | |
| A consumer may load the following info for a commit from the graph:
 | |
| 
 | |
| 1. The commit OID.
 | |
| 2. The list of parents, along with their integer position.
 | |
| 3. The commit date.
 | |
| 4. The root tree OID.
 | |
| 5. The generation number (see definition below).
 | |
| 
 | |
| Values 1-4 satisfy the requirements of parse_commit_gently().
 | |
| 
 | |
| Define the "generation number" of a commit recursively as follows:
 | |
| 
 | |
|  * A commit with no parents (a root commit) has generation number one.
 | |
| 
 | |
|  * A commit with at least one parent has generation number one more than
 | |
|    the largest generation number among its parents.
 | |
| 
 | |
| Equivalently, the generation number of a commit A is one more than the
 | |
| length of a longest path from A to a root commit. The recursive definition
 | |
| is easier to use for computation and observing the following property:
 | |
| 
 | |
|     If A and B are commits with generation numbers N and M, respectively,
 | |
|     and N <= M, then A cannot reach B. That is, we know without searching
 | |
|     that B is not an ancestor of A because it is further from a root commit
 | |
|     than A.
 | |
| 
 | |
|     Conversely, when checking if A is an ancestor of B, then we only need
 | |
|     to walk commits until all commits on the walk boundary have generation
 | |
|     number at most N. If we walk commits using a priority queue seeded by
 | |
|     generation numbers, then we always expand the boundary commit with highest
 | |
|     generation number and can easily detect the stopping condition.
 | |
| 
 | |
| This property can be used to significantly reduce the time it takes to
 | |
| walk commits and determine topological relationships. Without generation
 | |
| numbers, the general heuristic is the following:
 | |
| 
 | |
|     If A and B are commits with commit time X and Y, respectively, and
 | |
|     X < Y, then A _probably_ cannot reach B.
 | |
| 
 | |
| This heuristic is currently used whenever the computation is allowed to
 | |
| violate topological relationships due to clock skew (such as "git log"
 | |
| with default order), but is not used when the topological order is
 | |
| required (such as merge base calculations, "git log --graph").
 | |
| 
 | |
| In practice, we expect some commits to be created recently and not stored
 | |
| in the commit graph. We can treat these commits as having "infinite"
 | |
| generation number and walk until reaching commits with known generation
 | |
| number.
 | |
| 
 | |
| Design Details
 | |
| --------------
 | |
| 
 | |
| - The commit graph file is stored in a file named 'commit-graph' in the
 | |
|   .git/objects/info directory. This could be stored in the info directory
 | |
|   of an alternate.
 | |
| 
 | |
| - The core.commitGraph config setting must be on to consume graph files.
 | |
| 
 | |
| - The file format includes parameters for the object ID hash function,
 | |
|   so a future change of hash algorithm does not require a change in format.
 | |
| 
 | |
| Future Work
 | |
| -----------
 | |
| 
 | |
| - The commit graph feature currently does not honor commit grafts. This can
 | |
|   be remedied by duplicating or refactoring the current graft logic.
 | |
| 
 | |
| - The 'commit-graph' subcommand does not have a "verify" mode that is
 | |
|   necessary for integration with fsck.
 | |
| 
 | |
| - The file format includes room for precomputed generation numbers. These
 | |
|   are not currently computed, so all generation numbers will be marked as
 | |
|   0 (or "uncomputed"). A later patch will include this calculation.
 | |
| 
 | |
| - After computing and storing generation numbers, we must make graph
 | |
|   walks aware of generation numbers to gain the performance benefits they
 | |
|   enable. This will mostly be accomplished by swapping a commit-date-ordered
 | |
|   priority queue with one ordered by generation number. The following
 | |
|   operations are important candidates:
 | |
| 
 | |
|     - paint_down_to_common()
 | |
|     - 'log --topo-order'
 | |
| 
 | |
| - Currently, parse_commit_gently() requires filling in the root tree
 | |
|   object for a commit. This passes through lookup_tree() and consequently
 | |
|   lookup_object(). Also, it calls lookup_commit() when loading the parents.
 | |
|   These method calls check the ODB for object existence, even if the
 | |
|   consumer does not need the content. For example, we do not need the
 | |
|   tree contents when computing merge bases. Now that commit parsing is
 | |
|   removed from the computation time, these lookup operations are the
 | |
|   slowest operations keeping graph walks from being fast. Consider
 | |
|   loading these objects without verifying their existence in the ODB and
 | |
|   only loading them fully when consumers need them. Consider a method
 | |
|   such as "ensure_tree_loaded(commit)" that fully loads a tree before
 | |
|   using commit->tree.
 | |
| 
 | |
| - The current design uses the 'commit-graph' subcommand to generate the graph.
 | |
|   When this feature stabilizes enough to recommend to most users, we should
 | |
|   add automatic graph writes to common operations that create many commits.
 | |
|   For example, one could compute a graph on 'clone', 'fetch', or 'repack'
 | |
|   commands.
 | |
| 
 | |
| - A server could provide a commit graph file as part of the network protocol
 | |
|   to avoid extra calculations by clients. This feature is only of benefit if
 | |
|   the user is willing to trust the file, because verifying the file is correct
 | |
|   is as hard as computing it from scratch.
 | |
| 
 | |
| Related Links
 | |
| -------------
 | |
| [0] https://bugs.chromium.org/p/git/issues/detail?id=8
 | |
|     Chromium work item for: Serialized Commit Graph
 | |
| 
 | |
| [1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/
 | |
|     An abandoned patch that introduced generation numbers.
 | |
| 
 | |
| [2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/
 | |
|     Discussion about generation numbers on commits and how they interact
 | |
|     with fsck.
 | |
| 
 | |
| [3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/
 | |
|     More discussion about generation numbers and not storing them inside
 | |
|     commit objects. A valuable quote:
 | |
| 
 | |
|     "I think we should be moving more in the direction of keeping
 | |
|      repo-local caches for optimizations. Reachability bitmaps have been
 | |
|      a big performance win. I think we should be doing the same with our
 | |
|      properties of commits. Not just generation numbers, but making it
 | |
|      cheap to access the graph structure without zlib-inflating whole
 | |
|      commit objects (i.e., packv4 or something like the "metapacks" I
 | |
|      proposed a few years ago)."
 | |
| 
 | |
| [4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u
 | |
|     A patch to remove the ahead-behind calculation from 'status'.
 |