 68e66f2987
			
		
	
	68e66f2987
	
	
	
		
			
			Co-authored-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
		
			
				
	
	
		
			271 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			271 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Parallel Checkout Design Notes
 | |
| ==============================
 | |
| 
 | |
| The "Parallel Checkout" feature attempts to use multiple processes to
 | |
| parallelize the work of uncompressing the blobs, applying in-core
 | |
| filters, and writing the resulting contents to the working tree during a
 | |
| checkout operation. It can be used by all checkout-related commands,
 | |
| such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
 | |
| 
 | |
| These commands share the following basic structure:
 | |
| 
 | |
| * Step 1: Read the current index file into memory.
 | |
| 
 | |
| * Step 2: Modify the in-memory index based upon the command, and
 | |
|   temporarily mark all cache entries that need to be updated.
 | |
| 
 | |
| * Step 3: Populate the working tree to match the new candidate index.
 | |
|   This includes iterating over all of the to-be-updated cache entries
 | |
|   and delete, create, or overwrite the associated files in the working
 | |
|   tree.
 | |
| 
 | |
| * Step 4: Write the new index to disk.
 | |
| 
 | |
| Step 3 is the focus of the "parallel checkout" effort described here.
 | |
| 
 | |
| Sequential Implementation
 | |
| -------------------------
 | |
| 
 | |
| For the purposes of discussion here, the current sequential
 | |
| implementation of Step 3 is divided in 3 parts, each one implemented in
 | |
| its own function:
 | |
| 
 | |
| * Step 3a: `unpack-trees.c:check_updates()` contains a series of
 | |
|   sequential loops iterating over the `cache_entry`'s array. The main
 | |
|   loop in this function calls the Step 3b function for each of the
 | |
|   to-be-updated entries.
 | |
| 
 | |
| * Step 3b: `entry.c:checkout_entry()` examines the existing working tree
 | |
|   for file conflicts, collisions, and unsaved changes. It removes files
 | |
|   and creates leading directories as necessary. It calls the Step 3c
 | |
|   function for each entry to be written.
 | |
| 
 | |
| * Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
 | |
|   it if necessary, creates the file in the working tree, writes the
 | |
|   smudged contents, calls `fstat()` or `lstat()`, and updates the
 | |
|   associated `cache_entry` struct with the stat information gathered.
 | |
| 
 | |
| It wouldn't be safe to perform Step 3b in parallel, as there could be
 | |
| race conditions between file creations and removals. Instead, the
 | |
| parallel checkout framework lets the sequential code handle Step 3b,
 | |
| and uses parallel workers to replace the sequential
 | |
| `entry.c:write_entry()` calls from Step 3c.
 | |
| 
 | |
| Rejected Multi-Threaded Solution
 | |
| --------------------------------
 | |
| 
 | |
| The most "straightforward" implementation would be to spread the set of
 | |
| to-be-updated cache entries across multiple threads. But due to the
 | |
| thread-unsafe functions in the ODB code, we would have to use locks to
 | |
| coordinate the parallel operation. An early prototype of this solution
 | |
| showed that the multi-threaded checkout would bring performance
 | |
| improvements over the sequential code, but there was still too much lock
 | |
| contention. A `perf` profiling indicated that around 20% of the runtime
 | |
| during a local Linux clone (on an SSD) was spent in locking functions.
 | |
| For this reason this approach was rejected in favor of using multiple
 | |
| child processes, which led to a better performance.
 | |
| 
 | |
| Multi-Process Solution
 | |
| ----------------------
 | |
| 
 | |
| Parallel checkout alters the aforementioned Step 3 to use multiple
 | |
| `checkout--worker` background processes to distribute the work. The
 | |
| long-running worker processes are controlled by the foreground Git
 | |
| command using the existing run-command API.
 | |
| 
 | |
| Overview
 | |
| ~~~~~~~~
 | |
| 
 | |
| Step 3b is only slightly altered; for each entry to be checked out, the
 | |
| main process performs the following steps:
 | |
| 
 | |
| * M1: Check whether there is any untracked or unclean file in the
 | |
|   working tree which would be overwritten by this entry, and decide
 | |
|   whether to proceed (removing the file(s)) or not.
 | |
| 
 | |
| * M2: Create the leading directories.
 | |
| 
 | |
| * M3: Load the conversion attributes for the entry's path.
 | |
| 
 | |
| * M4: Check, based on the entry's type and conversion attributes,
 | |
|   whether the entry is eligible for parallel checkout (more on this
 | |
|   later). If it is eligible, enqueue the entry and the loaded
 | |
|   attributes to later write the entry in parallel. If not, write the
 | |
|   entry right away, using the default sequential code.
 | |
| 
 | |
| Note: we save the conversion attributes associated with each entry
 | |
| because the workers don't have access to the main process' index state,
 | |
| so they can't load the attributes by themselves (and the attributes are
 | |
| needed to properly smudge the entry). Additionally, this has a positive
 | |
| impact on performance as (1) we don't need to load the attributes twice
 | |
| and (2) the attributes machinery is optimized to handle paths in
 | |
| sequential order.
 | |
| 
 | |
| After all entries have passed through the above steps, the main process
 | |
| checks if the number of enqueued entries is sufficient to spread among
 | |
| the workers. If not, it just writes them sequentially. Otherwise, it
 | |
| spawns the workers and distributes the queued entries uniformly in
 | |
| continuous chunks. This aims to minimize the chances of two workers
 | |
| writing to the same directory simultaneously, which could increase lock
 | |
| contention in the kernel.
 | |
| 
 | |
| Then, for each assigned item, each worker:
 | |
| 
 | |
| * W1: Checks if there is any non-directory file in the leading part of
 | |
|   the entry's path or if there already exists a file at the entry' path.
 | |
|   If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
 | |
|   this later).
 | |
| 
 | |
| * W2: Creates the file (with O_CREAT and O_EXCL).
 | |
| 
 | |
| * W3: Loads the blob into memory (inflating and delta reconstructing
 | |
|   it).
 | |
| 
 | |
| * W4: Applies any required in-process filter, like end-of-line
 | |
|   conversion and re-encoding.
 | |
| 
 | |
| * W5: Writes the result to the file descriptor opened at W2.
 | |
| 
 | |
| * W6: Calls `fstat()` or lstat()` on the just-written path, and sends
 | |
|   the result back to the main process, together with the end status of
 | |
|   the operation and the item's identification number.
 | |
| 
 | |
| Note that, when possible, steps W3 to W5 are delegated to the streaming
 | |
| machinery, removing the need to keep the entire blob in memory.
 | |
| 
 | |
| If the worker fails to read the blob or to write it to the working tree,
 | |
| it removes the created file to avoid leaving empty files behind. This is
 | |
| the *only* time a worker is allowed to remove a file.
 | |
| 
 | |
| As mentioned earlier, it is the responsibility of the main process to
 | |
| remove any file that blocks the checkout operation (or abort if the
 | |
| removal(s) would cause data loss and the user didn't ask to `--force`).
 | |
| This is crucial to avoid race conditions and also to properly detect
 | |
| path collisions at Step W1.
 | |
| 
 | |
| After the workers finish writing the items and sending back the required
 | |
| information, the main process handles the results in two steps:
 | |
| 
 | |
| - First, it updates the in-memory index with the `lstat()` information
 | |
|   sent by the workers. (This must be done first as this information
 | |
|   might me required in the following step.)
 | |
| 
 | |
| - Then it writes the items which collided on disk (i.e. items marked
 | |
|   with `PC_ITEM_COLLIDED`). More on this below.
 | |
| 
 | |
| Path Collisions
 | |
| ---------------
 | |
| 
 | |
| Path collisions happen when two different paths correspond to the same
 | |
| entry in the file system. E.g. the paths 'a' and 'A' would collide in a
 | |
| case-insensitive file system.
 | |
| 
 | |
| The sequential checkout deals with collisions in the same way that it
 | |
| deals with files that were already present in the working tree before
 | |
| checkout. Basically, it checks if the path that it wants to write
 | |
| already exists on disk, makes sure the existing file doesn't have
 | |
| unsaved data, and then overwrites it. (To be more pedantic: it deletes
 | |
| the existing file and creates the new one.) So, if there are multiple
 | |
| colliding files to be checked out, the sequential code will write each
 | |
| one of them but only the last will actually survive on disk.
 | |
| 
 | |
| Parallel checkout aims to reproduce the same behavior. However, we
 | |
| cannot let the workers racily write to the same file on disk. Instead,
 | |
| the workers detect when the entry that they want to check out would
 | |
| collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
 | |
| Later, the main process can sequentially feed these entries back to
 | |
| `checkout_entry()` without the risk of race conditions. On clone, this
 | |
| also has the effect of marking the colliding entries to later emit a
 | |
| warning for the user, like the classic sequential checkout does.
 | |
| 
 | |
| The workers are able to detect both collisions among the entries being
 | |
| concurrently written and collisions between a parallel-eligible entry
 | |
| and an ineligible entry. The general idea for collision detection is
 | |
| quite straightforward: for each parallel-eligible entry, the main
 | |
| process must remove all files that prevent this entry from being written
 | |
| (before enqueueing it). This includes any non-directory file in the
 | |
| leading path of the entry. Later, when a worker gets assigned the entry,
 | |
| it looks again for the non-directories files and for an already existing
 | |
| file at the entry's path. If any of these checks finds something, the
 | |
| worker knows that there was a path collision.
 | |
| 
 | |
| Because parallel checkout can distinguish path collisions from the case
 | |
| where the file was already present in the working tree before checkout,
 | |
| we could alternatively choose to skip the checkout of colliding entries.
 | |
| However, each entry that doesn't get written would have NULL `lstat()`
 | |
| fields on the index. This could cause performance penalties for
 | |
| subsequent commands that need to refresh the index, as they would have
 | |
| to go to the file system to see if the entry is dirty. Thus, if we have
 | |
| N entries in a colliding group and we decide to write and `lstat()` only
 | |
| one of them, every subsequent `git-status` will have to read, convert,
 | |
| and hash the written file N - 1 times. By checking out all colliding
 | |
| entries (like the sequential code does), we only pay the overhead once,
 | |
| during checkout.
 | |
| 
 | |
| Eligible Entries for Parallel Checkout
 | |
| --------------------------------------
 | |
| 
 | |
| As previously mentioned, not all entries passed to `checkout_entry()`
 | |
| will be considered eligible for parallel checkout. More specifically, we
 | |
| exclude:
 | |
| 
 | |
| - Symbolic links; to avoid race conditions that, in combination with
 | |
|   path collisions, could cause workers to write files at the wrong
 | |
|   place. For example, if we were to concurrently check out a symlink
 | |
|   'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
 | |
|   we could potentially end up writing the file 'A/f' at 'a/f', due to a
 | |
|   race condition.
 | |
| 
 | |
| - Regular files that require external filters (either "one shot" filters
 | |
|   or long-running process filters). These filters are black-boxes to Git
 | |
|   and may have their own internal locking or non-concurrent assumptions.
 | |
|   So it might not be safe to run multiple instances in parallel.
 | |
| +
 | |
| Besides, long-running filters may use the delayed checkout feature to
 | |
| postpone the return of some filtered blobs. The delayed checkout queue
 | |
| and the parallel checkout queue are not compatible and should remain
 | |
| separate.
 | |
| +
 | |
| Note: regular files that only require internal filters, like end-of-line
 | |
| conversion and re-encoding, are eligible for parallel checkout.
 | |
| 
 | |
| Ineligible entries are checked out by the classic sequential codepath
 | |
| *before* spawning workers.
 | |
| 
 | |
| Note: submodules's files are also eligible for parallel checkout (as
 | |
| long as they don't fall into any of the excluding categories mentioned
 | |
| above). But since each submodule is checked out in its own child
 | |
| process, we don't mix the superproject's and the submodules' files in
 | |
| the same parallel checkout process or queue.
 | |
| 
 | |
| The API
 | |
| -------
 | |
| 
 | |
| The parallel checkout API was designed with the goal of minimizing
 | |
| changes to the current users of the checkout machinery. This means that
 | |
| they don't have to call a different function for sequential or parallel
 | |
| checkout. As already mentioned, `checkout_entry()` will automatically
 | |
| insert the given entry in the parallel checkout queue when this feature
 | |
| is enabled and the entry is eligible; otherwise, it will just write the
 | |
| entry right away, using the sequential code. In general, callers of the
 | |
| parallel checkout API should look similar to this:
 | |
| 
 | |
| ----------------------------------------------
 | |
| int pc_workers, pc_threshold, err = 0;
 | |
| struct checkout state;
 | |
| 
 | |
| get_parallel_checkout_configs(&pc_workers, &pc_threshold);
 | |
| 
 | |
| /*
 | |
|  * This check is not strictly required, but it
 | |
|  * should save some time in sequential mode.
 | |
|  */
 | |
| if (pc_workers > 1)
 | |
| 	init_parallel_checkout();
 | |
| 
 | |
| for (each cache_entry ce to-be-updated)
 | |
| 	err |= checkout_entry(ce, &state, NULL, NULL);
 | |
| 
 | |
| err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
 | |
| ----------------------------------------------
 |