
Add a new test-tool helper, name-hash, to output the value of the name-hash algorithms for the input list of strings, one per line. Since the name-hash values can be stored in the .bitmap files, it is important that these hash functions do not change across Git versions. Add a simple test to t5310-pack-bitmaps.sh to provide some testing of the current values. Due to how these functions are implemented, it would be difficult to change them without disturbing these values. The paths used for this test are carefully selected to demonstrate some of the behavior differences of the two current name hash versions, including which conditions will cause them to collide. Create a performance test that uses test_size to demonstrate how collisions occur for these hash algorithms. This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. My copy of the Git repository shows modest statistics around the collisions of the default name-hash algorithm: Test this tree -------------------------------------------------- 5314.1: paths at head 4.5K 5314.2: distinct hash value: v1 4.1K 5314.3: maximum multiplicity: v1 13 5314.4: distinct hash value: v2 4.2K 5314.5: maximum multiplicity: v2 9 Here, the maximum collision multiplicity is 13, but around 10% of paths have a collision with another path. In a more interesting example, the microsoft/fluentui [1] repo had these statistics at time of committing: Test this tree -------------------------------------------------- 5314.1: paths at head 19.5K 5314.2: distinct hash value: v1 8.2K 5314.3: maximum multiplicity: v1 279 5314.4: distinct hash value: v2 17.8K 5314.5: maximum multiplicity: v2 44 [1] https://github.com/microsoft/fluentui That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. With the v2 name hash function, the maximum multiplicity lowers to 44, leaving some room for further improvement. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree -------------------------------------------------- 5314.1: paths at head 227.3K 5314.2: distinct hash value: v1 72.3K 5314.3: maximum multiplicity: v1 14.4K 5314.4: distinct hash value: v2 166.5K 5314.5: maximum multiplicity: v2 138 Here, we can see that the v2 name hash function provides somem improvements, but there are still a number of collisions that could lead to repacking problems at this scale. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
32 lines
638 B
Bash
Executable File
32 lines
638 B
Bash
Executable File
#!/bin/sh
|
|
|
|
test_description='Tests pack performance using bitmaps'
|
|
. ./perf-lib.sh
|
|
|
|
GIT_TEST_PASSING_SANITIZE_LEAK=0
|
|
export GIT_TEST_PASSING_SANITIZE_LEAK
|
|
|
|
test_perf_large_repo
|
|
|
|
test_size 'paths at head' '
|
|
git ls-tree -r --name-only HEAD >path-list &&
|
|
wc -l <path-list &&
|
|
test-tool name-hash <path-list >name-hashes
|
|
'
|
|
|
|
for version in 1 2
|
|
do
|
|
test_size "distinct hash value: v$version" '
|
|
awk "{ print \$$version; }" <name-hashes | sort | \
|
|
uniq -c >name-hash-count &&
|
|
wc -l <name-hash-count
|
|
'
|
|
|
|
test_size "maximum multiplicity: v$version" '
|
|
sort -nr <name-hash-count | head -n 1 | \
|
|
awk "{ print \$1; }"
|
|
'
|
|
done
|
|
|
|
test_done
|