Skip to content

[RFC] Multi-pack Index (MIDX)#1

Closed
derrickstolee wants to merge 18 commits intomasterfrom
midx/review
Closed

[RFC] Multi-pack Index (MIDX)#1
derrickstolee wants to merge 18 commits intomasterfrom
midx/review

Conversation

@derrickstolee
Copy link
Owner

This RFC includes a new way to index the objects in multiple packs
using one file, called the multi-pack index (MIDX).

The commits are split into parts as follows:

[01] - A full design document.

[02] - The full file format for MIDX files.

[03] - Creation of core.midx config setting.

[04-12] - Creation of "midx" builtin that writes, reads, and deletes
MIDX files.

[13-18] - Consume MIDX files for abbreviations and object loads.

The main goals of this RFC are:

  • Determine interest in this feature.

  • Find other use cases for the MIDX feature.

  • Design a proper command-line interface for constructing and checking
    MIDX files. The current "midx" builtin is probably inadequate.

  • Determine what additional changes are needed before the feature can
    be merged. Specifically, I'm interested in the interactions with
    repack and fsck. The current patch also does not update the MIDX on
    a fetch (which adds a packfile) but would be valuable. Whenever
    possible, I tried to leave out features that could be added in a
    later patch.

  • Consider splitting this patch into multiple patches, such as:

    i. The MIDX design document.
    ii. The command-line interface for building and reading MIDX files.
    iii. Integrations with abbreviations and object lookups.

Please do not send any style nits to this patch, as I expect the code to
change dramatically before we consider merging.

I created three copies of the Linux repo with 1, 24, and 127 packfiles
each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies
gave significant performance improvements on the following comand:

git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX | Rel % | 1 pack %
----------+-------------+------------+--------+----------
1 | 35.64 s | 35.28 s | -1.0% | -1.0%
24 | 90.81 s | 40.06 s | -55.9% | +12.4%
127 | 257.97 s | 42.25 s | -83.6% | +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

git midx --write --update-head --delete-expired --pack-dir=<alt>

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.
A multi-pack-index (MIDX) file indexes the objects in multiple
packfiles in a single pack directory. After a simple fixed-size
header, the version 1 file format uses chunks to specify
different regions of the data that correspond to different types
of data, including:

- List of OIDs in lex-order
- A fanout table into the OID list
- List of packfile names (relative to pack directory)
- List of object metadata
- Large offsets (if needed)

By adding extra optional chunks, we can easily extend this format
without invalidating written v1 files.

One value in the header corresponds to a number of "base MIDX files"
and will always be zero until the value is used in a later patch.

We considered using a hashtable format instead of an ordered list
of objects to reduce the O(log N) lookups to O(1) lookups, but two
main issues arose that lead us to abandon the idea:

- Extra space required to ensure collision counts were low.
- We need to identify the two lexicographically closest OIDs for
  fast abbreviations. Binary search allows this.

The current solution presents multiple packfiles as if they were
packed into a single packfile with one pack-index.
As the multi-pack-index feature is being developed, we use a config
setting 'core.midx' to enable all use of MIDX files.

Since MIDX files are designed as a way to augment the existing data
stores in Git, turning this setting off will revert to previous
behavior without needing to downgrade. This can also be a repo-
specific setting if the MIDX is misbehaving in only one repo.
The write_midx_file() method takes a list of packfiles and indexed
objects with offset information and writes according to the format
in Documentation/technical/pack-format.txt. The chunks are separated
into methods.
Create, document, and implement the first ability of the midx builtin.

The --write subcommand creates a multi-pack-index for all indexed
packfiles within a given pack directory. If none is provided, the
objects/pack directory is implied. The arguments allow specifying the
pack directory so we can add MIDX files to alternates.

The packfiles are expected to be paired with pack-indexes and are
otherwise ignored. This simplifies the implementation and also keeps
compatibility with older versions of Git (or changing core.midx to
false).
Test interactions between the midx builtin and other Git operations.

Use both a full repo and a bare repo to ensure the pack directory
redirection works correctly.
There may be multiple MIDX files in a single pack directory. The primary
file is pointed to by a pointer file "midx-head" that contains an OID.
The MIDX file to load is then given by "midx-<OID>.midx".

This head file will be especially important when the MIDX files are
extended to be incremental and we expect multiple MIDX files at any
point.
Add a "--read" subcommand to the midx builtin to report summary information
on the head MIDX file or a MIDX file specified by the supplied "--midx-id"
parameter.

This subcommand is used by t5318-midx.sh to verify the indexed objects are
as expected.
The MIDX file stores pack offset information for a list of objects. The
nth_midxed_object_* methods provide ways to extract this information.
When writing a new MIDX file, it is faster to use an existing MIDX file
to load the object list and pack offsets and to only inspect pack-indexes
for packs not already covered by the MIDX file.
As a way to troubleshoot unforeseen problems with MIDX files, provide
a way to delete "midx-head" and the MIDX it references.
As we write new MIDX files, the existing files are probably not needed. Supply
the "--delete-expired" flag to remove these files during the "--write" sub-
command.
Perform some basic read-only operations that load objects and find
abbreviations. As this functionality begins to reference MIDX files,
ensure the output matches when using MIDX files and when not using them.
Replace prepare_packed_git() with prepare_packed_git_internal(use_midx) to
allow some consumers of prepare_packed_git() with a way to load MIDX files.
Consumers should only use the new method if they are prepared to use the
midxed_git struct alongside the packed_git struct.

If a packfile is found that is not referenced by the current MIDX, then add
it to the packed_git struct. This is important to keep the MIDX useful after
adding packs due to "fetch" commands and when third-party tools (such as
libgit2) add packs directly to the repo.

If prepare_packed_git_internal is called with use_midx = 0, then unload the
MIDX file and reload the packfiles in to the packed_git struct.
The MIDX files contain a complete object count, so we can report the number
of objects in the MIDX. The count remains approximate as there may be overlap
between the packfiles not referenced by the MIDX.
Using a binary search, we can navigate to the position n within a
MIDX file where an object appears in the ordered list of objects.
Create unique_in_midx() to mimic behavior of unique_in_pack().

Create find_abbrev_len_for_midx() to mimic behavior of
find_abbrev_len_for_pack().

Consume these methods when interacting with abbreviations.
When looking for a packed object, first check the MIDX for that object.
This reduces thrashing in the MRU list of packfiles.
derrickstolee pushed a commit that referenced this pull request Apr 2, 2018
The function ce_write_entry() uses a 'self-initialised' variable
construct, for the symbol 'saved_namelen', to suppress a gcc
'-Wmaybe-uninitialized' warning, given that the warning is a false
positive.

For the purposes of this discussion, the ce_write_entry() function has
three code blocks of interest, that look like so:

        /* block #1 */
        if (ce->ce_flags & CE_STRIP_NAME) {
                saved_namelen = ce_namelen(ce);
                ce->ce_namelen = 0;
        }

        /* block #2 */
        /*
	 * several code blocks that contain, among others, calls
         * to copy_cache_entry_to_ondisk(ondisk, ce);
         */

        /* block #3 */
        if (ce->ce_flags & CE_STRIP_NAME) {
                ce->ce_namelen = saved_namelen;
                ce->ce_flags &= ~CE_STRIP_NAME;
        }

The warning implies that gcc thinks it is possible that the first
block is not entered, the calls to copy_cache_entry_to_ondisk()
could toggle the CE_STRIP_NAME flag on, thereby entering block #3
with saved_namelen unset. However, the copy_cache_entry_to_ondisk()
function does not write to ce->ce_flags (it only reads). gcc could
easily determine this, since that function is local to this file,
but it obviously doesn't.

In order to suppress this warning, we make it clear to the reader
(human and compiler), that block #3 will only be entered when the
first block has been entered, by introducing a new 'stripped_name'
boolean variable. We also take the opportunity to change the type
of 'saved_namelen' to 'unsigned int' to match ce->ce_namelen.

Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Jul 9, 2018
Change "fetch" to treat "+" in refspecs (aka --force) to mean we
should clobber a local tag of the same name.

This changes the long-standing behavior of "fetch" added in
853a369 ("[PATCH] Multi-head fetch.", 2005-08-20), before this
change all tag fetches effectively had --force enabled. The original
rationale in that change was:

    > Tags need not be pointing at commits so there is no way to
    > guarantee "fast-forward" anyway.

That comment and the rest of the history of "fetch" shows that the
"+" (--force) part of refpecs was only conceived for branch updates,
while tags have accepted any changes from upstream unconditionally and
clobbered the local tag object. Changing this behavior has been
discussed as early as 2011[1].

I the current behavior doesn't make sense, it easily results in local
tags accidentally being clobbered. Ideally we'd namespace our tags
per-remote, but as with my 97716d2 ("fetch: add a --prune-tags
option and fetch.pruneTags config", 2018-02-09) it's easier to work
around the current implementation than to fix the root cause, so this
implements suggestion #1 from [1], "fetch" now only clobbers the tag
if either "+" is provided as part of the refspec, or if "--force" is
provided on the command-line.

This also makes it nicely symmetrical with how "tag" itself
works. We'll now refuse to clobber any existing tags unless "--force"
is supplied, whether that clobbering would happen by clobbering a
local tag with "tag", or by fetching it from the remote with "fetch".

It's still not at all nicely symmetrical with how "git push" works, as
discussed in the updated pull-fetch-param.txt documentation, but this
change brings them more into line with one another. I don't think
there's any reason "fetch" couldn't fully converge with the behavior
used by "push", but that's a topic for another change.

One of the tests added in 31b808a ("clone --single: limit the fetch
refspec to fetched branch", 2012-09-20) is being changed to use
--force where a clone would clobber a tag. This changes nothing about
the existing behavior of the test.

1. https://public-inbox.org/git/20111123221658.GA22313@sigill.intra.peff.net/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Aug 21, 2018
When "git rebase -i" is told to squash two or more commits into
one, it labeled the log message for each commit with its number.
It correctly called the first one "1st commit", but the next one
was "commit #1", which was off-by-one.  This has been corrected.

* pw/rebase-i-squash-number-fix:
  rebase -i: fix numbering in squash message
derrickstolee pushed a commit that referenced this pull request Aug 30, 2018
The verbose output of the test 'reword without issues functions as
intended' in 't3423-rebase-reword.sh', added in a9279c6 (sequencer:
do not squash 'reword' commits when we hit conflicts, 2018-06-19),
contains the following error output:

  sed: -e expression #1, char 2: extra characters after command

This error comes from within the 'fake-editor.sh' script created by
'lib-rebase.sh's set_fake_editor() function, and the root cause is the
FAKE_LINES="pick 1 reword 2" variable in the test in question, in
particular the "pick" word.  'fake-editor.sh' assumes 'pick' to be the
default rebase command and doesn't support an explicit 'pick' command
in FAKE_LINES.  As a result, 'pick' will be used instead of a line
number when assembling the following 'sed' script:

  sed -n picks/^pick/pick/p

which triggers the aforementioned error.

Luckily, this didn't affect the test's correctness: the erroring 'sed'
command doesn't write anything to the todo script, and processing the
rest of FAKE_LINES generates the desired todo script, as if that
'pick' command were not there at all.

The minimal fix would be to remove the 'pick' word from FAKE_LINES,
but that would leave us susceptible to similar issues in the future.

Instead, teach the fake-editor script to recognize an explicit 'pick'
command, which is still a fairly trivial change.

In the future we might want to consider reinforcing this fake editor
script with an &&-chain and stricter parsing of the FAKE_LINES
variable (e.g. to error out when encountering unknown rebase commands
or commands and line numbers in the wrong order).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Sep 11, 2018
When "git rebase -i" is told to squash two or more commits into
one, it labeled the log message for each commit with its number.
It correctly called the first one "1st commit", but the next one
was "commit #1", which was off-by-one.  This has been corrected.

* pw/rebase-i-squash-number-fix:
  rebase -i: fix numbering in squash message
derrickstolee pushed a commit that referenced this pull request Sep 17, 2018
serialize-status: serialize global and repo-local exclude file metadata
derrickstolee pushed a commit that referenced this pull request Sep 18, 2018
Change "fetch" to treat "+" in refspecs (aka --force) to mean we
should clobber a local tag of the same name.

This changes the long-standing behavior of "fetch" added in
853a369 ("[PATCH] Multi-head fetch.", 2005-08-20). Before this
change, all tag fetches effectively had --force enabled. See the
git-fetch-script code in fast_forward_local() with the comment:

    > Tags need not be pointing at commits so there is no way to
    > guarantee "fast-forward" anyway.

That commit and the rest of the history of "fetch" shows that the
"+" (--force) part of refpecs was only conceived for branch updates,
while tags have accepted any changes from upstream unconditionally and
clobbered the local tag object. Changing this behavior has been
discussed as early as 2011[1].

The current behavior doesn't make sense to me, it easily results in
local tags accidentally being clobbered. We could namespace our tags
per-remote and not locally populate refs/tags/*, but as with my
97716d2 ("fetch: add a --prune-tags option and fetch.pruneTags
config", 2018-02-09) it's easier to work around the current
implementation than to fix the root cause.

So this change implements suggestion #1 from Jeff's 2011 E-Mail[1],
"fetch" now only clobbers the tag if either "+" is provided as part of
the refspec, or if "--force" is provided on the command-line.

This also makes it nicely symmetrical with how "tag" itself works when
creating tags. I.e. we refuse to clobber any existing tags unless
"--force" is supplied. Now we can refuse all such clobbering, whether
it would happen by clobbering a local tag with "tag", or by fetching
it from the remote with "fetch".

Ref updates outside refs/{tags,heads/* are still still not symmetrical
with how "git push" works, as discussed in the recently changed
pull-fetch-param.txt documentation. This change brings the two
divergent behaviors more into line with one another. I don't think
there's any reason "fetch" couldn't fully converge with the behavior
used by "push", but that's a topic for another change.

One of the tests added in 31b808a ("clone --single: limit the fetch
refspec to fetched branch", 2012-09-20) is being changed to use
--force where a clone would clobber a tag. This changes nothing about
the existing behavior of the test.

1. https://public-inbox.org/git/20111123221658.GA22313@sigill.intra.peff.net/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Oct 8, 2018
check_one_conflict() compares `i` to `active_nr` in two places to avoid
buffer overruns, but left out an important third location.

The code did used to have a check here comparing i to active_nr, back
before commit fb70a06 ("rerere: fix an off-by-one non-bug",
2015-06-28), however the code at the time used an 'if' rather than a
'while' meaning back then that this loop could not have read past the
end of the array, making the check unnecessary and it was removed.
Unfortunately, in commit 5eda906 ("rerere: handle conflicts with
multiple stage #1 entries", 2015-07-24), the 'if' was changed to a
'while' and the check comparing i and active_nr was not re-instated,
leading to this problem.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Oct 30, 2018
Reimplement oidset using khash.h in order to reduce its memory footprint
and make it faster.

Performance of a command that mainly checks for duplicate objects using
an oidset, with master and Clang 6.0.1:

  $ cmd="./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'"

  $ /usr/bin/time $cmd >/dev/null
  0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k
  0inputs+0outputs (0major+11204minor)pagefaults 0swaps

  $ hyperfine "$cmd"
  Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

    Time (mean ± σ):     250.0 ms ±   6.0 ms    [User: 225.9 ms, System: 23.6 ms]

    Range (min … max):   242.0 ms … 261.1 ms

And with this patch:

  $ /usr/bin/time $cmd >/dev/null
  0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k
  0inputs+0outputs (0major+8318minor)pagefaults 0swaps

  $ hyperfine "$cmd"
  Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

    Time (mean ± σ):     151.9 ms ±   4.9 ms    [User: 130.5 ms, System: 21.2 ms]

    Range (min … max):   148.2 ms … 170.4 ms

Initial-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Rene Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Oct 30, 2018
Ever since the split index feature was introduced [1], refreshing a
split index is prone to a variant of the classic racy git problem.
There are a couple of unrelated tests in the test suite that
occasionally fail when run with 'GIT_TEST_SPLIT_INDEX=yes', but
't1700-split-index.sh', the only test script focusing solely on split
index, has never noticed this issue, because it only cares about how
the index is split under various circumstances and all the different
ways to turn the split index feature on and off.

Add a dedicated test script 't1701-racy-split-index.sh' to exercise
the split index feature in racy situations as well; kind of a
"t0010-racy-git.sh for split index" but with modern style (the tests
do everything in &&-chained list of commands in 'test_expect_...'
blocks, and use 'test_cmp' for more informative output on failure).

The tests cover the following sequences of index splitting, updating,
and racy file modifications, with the last two cases demonstrating the
racy split index problem:

  1. Split the index while adding a racily clean file:

       echo "cached content" >file
       git update-index --split-index --add file
       echo "dirty worktree" >file    # size stays the same

     This case already works properly.  Even though the cache entry's
     stat data matches with the modifid file in the worktree,
     subsequent git commands will notice that the (split) index and
     the file have the same mtime, and then will go on to check the
     file's content and notice its dirtiness.

  2. Add a racily clean file to an already split index:

       git update-index --split-index
       echo "cached content" >file
       git update-index --add file
       echo "dirty worktree" >file

     This case already works properly.  After the second 'git
     update-index' writes the newly added file's cache entry to the
     new split index, it basically works in the same way as case #1.

  3. Split the index when it (i.e. the not yet splitted index)
     contains a racily clean cache entry, i.e. an entry whose cached
     stat data matches with the corresponding file in the worktree and
     the cached mtime matches that of the index:

       echo "cached content" >file
       git update-index --add file
       echo "dirty worktree" >file
       # ... wait ...
       git update-index --split-index --add other-file

     This case already works properly.  The shared index is written by
     do_write_index(), i.e. the same function that is responsible for
     writing "regular" and split indexes as well.  This function
     cleverly notices the racily clean cache entry, and writes the
     entry to the new shared index with smudged stat data, i.e. file
     size set to 0.  When subsequent git commands read the index, they
     will notice that the smudged stat data doesn't match with the
     file in the worktree, and then go on to check the file's content
     and notice its dirtiness.

  4. Update the split index when it contains a racily clean cache
     entry:

       git update-index --split-index
       echo "cached content" >file
       git update-index --add file
       echo "dirty worktree" >file
       # ... wait ...
       git update-index --add other-file

     This case already works properly.  After the second 'git
     update-index' the newly added file's cache entry is only stored
     in the split index.  If a cache entry is present in the split
     index (even if it is a replacement of an outdated entry in the
     shared index), then it will always be included in the new split
     index on subsequent split index updates (until the file is
     removed or a new shared index is written), independently from
     whether the entry is racily clean or not.  When do_write_index()
     writes the new split index, it notices the racily clean cache
     entry, and smudges its stat date.  Subsequent git commands
     reading the index will notice the smudged stat data and then go
     on to check the file's content and notice its dirtiness.

  5. Update the split index when a racily clean cache entry is stored
     only in the shared index:

       echo "cached content" >file
       git update-index --split-index --add file
       echo "dirty worktree" >file
       # ... wait ...
       git update-index --add other-file

     This case fails due to the racy split index problem.  In the
     second 'git update-index' prepare_to_write_split_index() decides,
     among other things, which cache entries stored only in the shared
     index should be replaced in the new split index.  Alas, this
     function never looks out for racily clean cache entries, and
     since the file's stat data in the worktree hasn't changed since
     the shared index was written, the entry won't be replaced in the
     new split index.  Consequently, do_write_index() doesn't even get
     this racily clean cache entry, and can't smudge its stat data.
     Subsequent git commands will then see that the index has more
     recent mtime than the file and that the (not smudged) cached stat
     data still matches with the file in the worktree, and,
     ultimately, will erroneously consider the file clean.

  6. Update the split index after unpack_trees() copied a racily clean
     cache entry from the shared index:

       echo "cached content" >file
       git update-index --split-index --add file
       echo "dirty worktree" >file
       # ... wait ...
       git read-tree -m HEAD

     This case fails due to the racy split index problem.  This
     basically fails for the same reason as case #5 above, but there
     is one important difference, which warrants the dedicated test.
     While that second 'git update-index' in case #5 updates
     index_state in place, in this case 'git read-tree -m' calls
     unpack_trees(), which throws out the entire index, and constructs
     a new one from the (potentially updated) copies of the original's
     cache entries.  Consequently, when prepare_to_write_split_index()
     gets to work on this reconstructed index, it takes a different
     code path than in case #5 when deciding which cache entries in
     the shared index should be replaced.  The result is the same,
     though: the racily clean cache entry goes unnoticed, it isn't
     added to the split index with smudged stat data, and subsequent
     git commands will then erroneously consider the file clean.

Note that in the last two 'test_expect_failure' cases I omitted the
'#' (as in nr. of trial) from the tests' description on purpose for
now, as it breakes the TAP output [2]; it will be added at the end of
the series, when those two tests will be flipped to
'test_expect_success'.

[1] In the branch leading to the merge commit v2.1.0-rc0~45 (Merge
    branch 'nd/split-index', 2014-07-16).

[2] In the TAP output a '#' should separate the test's description
    from the TODO directive emitted by 'test_expect_failure'.  The
    additional '#' in "#$trial" interferes with this, the test harness
    won't recognize the TODO directive, and will report that those
    tests failed unexpectedly.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Nov 1, 2018
serialize-status: serialize global and repo-local exclude file metadata
derrickstolee pushed a commit that referenced this pull request Feb 19, 2020
The commit slab commit_rev_name contains a pointer to a struct rev_name,
and the actual struct is allocated separatly.  Avoid that allocation and
pointer indirection by storing the full struct in the commit slab.  Use
the tip_name member pointer to determine if the returned struct is
initialized.

Performance in the Linux repository measured with hyperfine before:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     953.5 ms ±   6.3 ms    [User: 901.2 ms, System: 52.1 ms]
  Range (min … max):   945.2 ms … 968.5 ms    10 runs

... and with this patch:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     851.0 ms ±   3.1 ms    [User: 807.4 ms, System: 43.6 ms]
  Range (min … max):   846.7 ms … 857.0 ms    10 runs

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Feb 19, 2020
We can calculate the size of new name easily and precisely. Open-code
the xstrfmt() calls and grow the buffers as needed before filling them.
This provides a surprisingly large benefit when working with the
Chromium repository; here are the numbers measured using hyperfine
before:

Benchmark #1: ./git -C ../chromium/src name-rev --all
  Time (mean ± σ):      5.822 s ±  0.013 s    [User: 5.304 s, System: 0.516 s]
  Range (min … max):    5.803 s …  5.837 s    10 runs

... and with this patch:

Benchmark #1: ./git -C ../chromium/src name-rev --all
  Time (mean ± σ):      1.527 s ±  0.003 s    [User: 1.015 s, System: 0.511 s]
  Range (min … max):    1.524 s …  1.535 s    10 runs

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Feb 19, 2020
Leave setting the tip_name member of struct rev_name to callers of
create_or_update_name().  This avoids allocations for names that are
rejected by that function.  Here's how this affects the runtime when
working with a fresh clone of Git's own repository; performance numbers
by hyperfine before:

Benchmark #1: ./git -C ../git-pristine/ name-rev --all
  Time (mean ± σ):     437.8 ms ±   4.0 ms    [User: 422.5 ms, System: 15.2 ms]
  Range (min … max):   432.8 ms … 446.3 ms    10 runs

... and with this patch:

Benchmark #1: ./git -C ../git-pristine/ name-rev --all
  Time (mean ± σ):     408.5 ms ±   1.4 ms    [User: 387.2 ms, System: 21.2 ms]
  Range (min … max):   407.1 ms … 411.7 ms    10 runs

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Feb 19, 2020
name_rev() assigns a name to a commit and its parents and grandparents
and so on.  Commits share their name string with their first parent,
which in turn does the same, recursively to the root.  That saves a lot
of allocations.  When a better name is found, the old name is replaced,
but its memory is not released.  That leakage can become significant.

Can we release these old strings exactly once even though they are
referenced multiple times?  Yes, indeed -- we can make use of the fact
that name_rev() visits the ancestors of a commit after it set a new name
for it and tries to update their names as well.

Members of the first ancestral line have the same taggerdate and
from_tag values, but a higher distance value than their child commit at
generation 0.  These are the only criteria used by is_better_name().
Lower distance values are considered better, so a name that is better
for a child will also be better for its parent and grandparent etc.

That means we can free(3) an inferior name at generation 0 and rely on
name_rev() to replace all references in ancestors as well.

If we do that then we need to stop using the string pointer alone to
distinguish new empty rev_name slots from initialized ones, though, as
it technically becomes invalid after the free(3) call -- even though its
value is still different from NULL.

We can check the generation value first, as empty slots will have it
initialized to 0, and for the actual generation 0 we'll set a new valid
name right after the create_or_update_name() call that releases the
string.

For the Chromium repo, releasing superceded names reduces the memory
footprint of name-rev --all significantly.  Here's the output of GNU
time before:

0.98user 0.48system 0:01.46elapsed 99%CPU (0avgtext+0avgdata 2601812maxresident)k
0inputs+0outputs (0major+571470minor)pagefaults 0swaps

... and with this patch:

1.01user 0.26system 0:01.28elapsed 100%CPU (0avgtext+0avgdata 1559196maxresident)k
0inputs+0outputs (0major+314370minor)pagefaults 0swaps

It also gets faster; hyperfine before:

Benchmark #1: ./git -C ../chromium/src name-rev --all
  Time (mean ± σ):      1.534 s ±  0.006 s    [User: 1.039 s, System: 0.494 s]
  Range (min … max):    1.522 s …  1.542 s    10 runs

... and with this patch:

Benchmark #1: ./git -C ../chromium/src name-rev --all
  Time (mean ± σ):      1.338 s ±  0.006 s    [User: 1.047 s, System: 0.291 s]
  Range (min … max):    1.327 s …  1.346 s    10 runs

For the Linux repo it doesn't pay off; memory usage only gets down from:

0.76user 0.03system 0:00.80elapsed 99%CPU (0avgtext+0avgdata 292848maxresident)k
0inputs+0outputs (0major+44579minor)pagefaults 0swaps

... to:

0.78user 0.03system 0:00.81elapsed 100%CPU (0avgtext+0avgdata 284696maxresident)k
0inputs+0outputs (0major+44892minor)pagefaults 0swaps

The runtime actually increases slightly from:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     828.8 ms ±   5.0 ms    [User: 797.2 ms, System: 31.6 ms]
  Range (min … max):   824.1 ms … 838.9 ms    10 runs

... to:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     847.6 ms ±   3.4 ms    [User: 807.9 ms, System: 39.6 ms]
  Range (min … max):   843.4 ms … 854.3 ms    10 runs

Why is that?  In the Chromium repo, ca. 44000 free(3) calls in
create_or_update_name() release almost 1GB, while in the Linux repo
240000+ calls release a bit more than 5MB, so the average discarded
name is ca.  1000x longer in the latter.

Overall I think it's the right tradeoff to make, as it helps curb the
memory usage in repositories with big discarded names, and the added
overhead is small.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Feb 19, 2020
name_ref() is called for each ref and checks if its a better name for
the referenced commit.  If that's the case it remembers it and checks if
a name based on it is better for its ancestors as well.  This in done in
the the order for_each_ref() imposes on us.

That might not be optimal.  If bad names happen to be encountered first
(as defined by is_better_name()), names derived from them may spread to
a lot of commits, only to be replaced by better names later.  Setting
better names first can avoid that.

is_better_name() prefers tags, short distances and old references.  The
distance is a measure that we need to calculate for each candidate
commit, but the other two properties are not dependent on the
relationships of commits.  Sorting the refs by them should yield better
performance than the essentially random order we currently use.

And applying older references first should also help to reduce rework
due to the fact that older commits have less ancestors than newer ones.

So add all details of names to the tip table first, then sort them
to prefer tags and older references and then apply them in this order.
Here's the performance as measures by hyperfine for the Linux repo
before:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     851.1 ms ±   4.5 ms    [User: 806.7 ms, System: 44.4 ms]
  Range (min … max):   845.9 ms … 859.5 ms    10 runs

... and with this patch:

Benchmark #1: ./git -C ../linux/ name-rev --all
  Time (mean ± σ):     736.2 ms ±   8.7 ms    [User: 688.4 ms, System: 47.5 ms]
  Range (min … max):   726.0 ms … 755.2 ms    10 runs

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Mar 24, 2020
serialize-status: serialize global and repo-local exclude file metadata
derrickstolee pushed a commit that referenced this pull request Jun 2, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Jun 5, 2020
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
derrickstolee pushed a commit that referenced this pull request Jun 5, 2020
In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
derrickstolee pushed a commit that referenced this pull request Jun 5, 2020
During a pathspec-limited revision walk, e.g. 'git log -- dir/file',
Git spends a significant part of its runtime in the tree-diff
machinery, checking whether the given path was modified between
subsequent commits.  By using Bloom filters to store the paths
modified by each commit we can quickly tell that a commit didn't
modify the given path without invoking tree-diff, thus considerably
reduce this diffing overhead, along with the runtime and memory usage.
This patch extends the commit-graph file format with additional chunks
to store those modified path Bloom filters.

The rest of this log message takes a closer look at the problem and
explains why these new chunks look like they do.

In the following the terms "file" and "path" are not used
interchangeably.  If a commit modified 'dir/subdir/foo.c', then it
modified that one file, but it modified three paths (namely 'dir',
'dir/subdir', and 'dir/subdir/foo.c').

Furthermore, unless otherwise noted, "a lot of paths" means 5000
randomly selected paths that have existed at one time or another
during the history of the corresponding repository's default branch
[1], except for git it means all 5089 paths that have ever existed
during the history of v2.25.0, and for homebrew-cask and homebrew-core
all 7307 and 6535 paths, respectively, that have ever existed in the
history of their default branches some time ago.

About pathspec-limited revision walks
-------------------------------------

So, during a pathspec-limited revision walk Git spends a significant
part of its runtime in the tree-diff machinery, checking whether the
given path was modified between subsequent commits.  The table below
shows the average runtime of 'git rev-list HEAD -- $path' for "a lot
of paths" in a couple of repositories, and how much of that runtime is
spent in diff_tree_oid().  It also shows the potential average
speedup, should we be able to reduce the tree-diff overhead to zero
without introducing some other overhead (spoiler alert: we won't be
able to achieve that, of course, but still will achieve even higher
average speedup in several cases).

                 Average    Average            Potential
                 runtime   diff time            speedup
  -------------------------------------------------------
  android-base   0.8780s    0.7320s    83.37%     6.14x
  cmssw          0.3143s    0.2800s    88.95%     9.19x
  cpython        0.7453s    0.6602s    88.58%     8.76x
  elasticsearch  0.1492s    0.1351s    90.55%    10.58x
  gcc            7.1852s    6.9432s    96.63%    29.69x
  gecko-dev      4.6113s    4.0964s    88.83%     8.96x
  git            0.6180s    0.5911s    95.65%    22.97x
  glibc          0.5618s    0.5313s    94.57%    18.42x
  go             0.4913s    0.4469s    90.96%    11.07x
  jdk            0.0482s    0.0431s    89.42%     9.45x
  linux          0.7043s    0.6163s    87.50%     8.00x
  llvm-project   2.6844s    2.2607s    84.22%     6.34x
  rails          0.2784s    0.2372s    85.20%     6.76x
  rust           0.7757s    0.7349s    94.74%    19.01x
  tensorflow     0.6258s    0.5735s    91.64%    11.96x
  webkit         1.9137s    1.6580s    86.64%     7.48x

Notice that the average time spent diffing in the Linux and Git
repositories is quite close (0.6163s vs 0.5911s), although the Linux
repository contains about 15x more commits and 25x more files; more on
this later.

Instrumenting the tree-diff machinery and gathering some stats about
the number of diff calls, tree object comparisons and decoded tree
entries revealed some interesting, perhaps even counterintuitive
things:

  - The number of scanned tree entries can be more important than the
    number of tree comparisons.

    Here are four paths from two repositories, two each in the same
    frequently-modified directory, with one near the beginning and one
    at the end of that directory.  When listing the commits modifying
    these paths, i.e. 'git rev-list HEAD -- $path', the number of tree
    comparisons is (nearly) the same, but the number of scanned tree
    entries differs by over two orders of magnitude.

                                      Average   Nr of scanned   Nr of tree
      Repository          $path      diff time   tree entries  comparisons
      --------------------------------------------------------------------
      git            .clang-format     0.188s          40115       18055
      git            zlib.c            0.729s       10705983       17120
      homebrew-core  Formula/a2ps.rb   8.390s        2122937      326758
      homebrew-core  Formula/zzz.rb   80.495s     1235547836      326758

    This is also noticeable when looking at the average number of
    scanned tree entries and tree comparisons when running 'git
    rev-list HEAD -- $path' for "a lot of paths":

                                Average nr      Average      Average
                     Average    of scanned     nr of tree   nr of diff
                    diff time  tree entries   comparisons      calls
      ----------------------------------------------------------------
      jdk            0.0431s       204466         5520         3702
      elasticsearch  0.1351s       720393        17212        13457
      rails          0.2372s      1299221        43369        33670
      cmssw          0.2800s      1938067        24798        23739
      go             0.4469s      3207155        76432        42346
      glibc          0.5313s      6555231        40442        29208
      tensorflow     0.5735s      3914851        97227        47262
      git            0.5911s      7480371        21789        18168
      linux          0.6163s      4004067        79514        58145
      cpython        0.6602s      4695754        88408        70439
      android-base   0.7320s      4312043       122280        96376
      rust           0.7349s      5084683        68110        29847
      webkit         1.6580s      9395349       303105       218906
      llvm-project   2.2607s     11773792       469801       330250
      gecko-dev      4.0964s     42396246       415843       387162
      gcc            6.9432s     84853083       286904       174218

    Note that although we have a lot less tree comparisons in the git
    than in the linux repository, it has almost double the amount of
    scanned tree entries, because the git repository has some
    frequently modified biggish directories (notably its root dir, 't'
    or 'Documentation').  As a result the average time spent diffing
    in the two repositories is roughly the same, although the linux
    repository is much larger.

    Or that we spend the most time diffing in the gcc repository,
    because it has by far the most scanned tree entries, even though
    it has less tree comparisons than any of the next three slowest
    repositories.

  - The number of path components in the pathspec, i.e. its depth in
    the directory tree seems to be irrelevant.

    When checking whether a path somewhere deep in the directory tree
    has been modified between a commit and its parent the tree-diff
    machinery can short-circuit, and it returns as soon as it finds
    the first leading directory that hasn't been modified.  And more
    often than not it can short-circuit already while comparing the
    root trees, as all the <1.5 values in the "Average tree
    comparisons per diff call" column show.

                     Average   Average tree     Average      Average
                    pathspec    comparisons    nr of diff   nr of tree
                      depth    per diff call      calls    comparisons
      ----------------------------------------------------------------
      android-base    5.78         1.26          96376       122280
      cmssw           4.28         1.04          23739        24798
      cpython         3.72         1.25          70439        88408
      elasticsearch   9.13         1.28          13457        17212
      gcc             5.13         1.64         174218       286904
      gecko-dev       6.25         1.07         387162       415843
      git             2.30         1.19          18168        21789
      glibc           4.17         1.38          29208        40442
      go              4.60         1.80          42346        76432
      jdk             8.45         1.49           3702         5520
      linux           4.49         1.36          58145        79514
      llvm-project    5.45         1.42         330250       469801
      rails           5.43         1.28          33670        43369
      rust            4.70         2.28          29847        68110
      tensorflow      5.56         2.06          47262        97227
      webkit          6.33         1.38         218906       303105

    Note the 2.xy average tree comparisons per diff call in the rust
    and tensorflow repositories.  In the rust repository over 98% of
    the paths are in the 'src' directory and over 93% of commits
    modify a file under that directory, while in the tensorflow
    repository over 92% of paths are in the 'tensorflow' directory and
    over 90% of commits modify a file under that directory.
    Consequently, the tree-diff machinery can only rarely
    short-circuit while comparing the root trees of two subsequent
    commits, but has to dive down and compare the contents of the
    'src' or 'tensorflow' directories most of the time as well.  I
    suspect that we would get a similar ~2 value in the homebrew-core
    repository as well, because over 99.7% of all commits modify the
    'Formula' directory, which contains over 90% of paths in the
    repository.

Bloom filters intro
-------------------

Quoting the (quite good) Wikipedia article, "A Bloom filter is a
space-efficient probabilistic data structure [...] that is used to
test whether an element is a member of a set. False positive matches
are possible, but false negatives are not – in other words, a query
returns either 'possibly in set' or 'definitely not in set'".

A Bloom filter is a bit array, initially all bits set to 0.  To add an
element to a Bloom filter, the element is hashed using 'k' independent
hash functions, the resulting hashes are turned into 'k' array
positions, and the bits at those positions in the filter are set to 1.
To query for an element, the element is hashed using the same 'k' hash
functions, the resulting hashes are turned into 'k' array positions,
and the values of the bits at those positions are checked.  If all
those bits are set, then the element is 'possibly in set'.  If even
one of those bits is unset, then the element is 'definitely not in
set'.

Some of the for us relevant properties of Bloom filters are:

  - A Bloom filter doesn't store the elements themselves, it only sets
    a couple of bits based on the elements' hashes.  Consequently, a
    Bloom filter can't tell which elements it contains; it can't even
    tell the number of elements in there.

  - A bit in the Bloom filter's bit array might be set by more than
    one element.  This is where the probabilistic nature and the
    possibility of false positives come from: it is possible that some
    elements in the set happen to have set all 'k' bits that would
    indicate the presence of a particular element, even though that
    element itself is not part of the set.

  - Elements can't be removed from a Bloom filter: the 'k' bits
    indicating the presence of an element can't simply be unset,
    because that might unset bits that were set by other elements as
    well.
    There are enhanced Bloom filter variants that allow removal of
    elements, like a counting Bloom filter, but they come with
    additional complexity and require considerably more space.

  - Bloom filters can't be resized, because turning the hashes into
    positions in the bit array critically depends on the size of the
    bit array (usually pos[i] = hash[i] % bit_array_size).

  - Bloom filters of the same size can be bitwise OR-ed to create the
    union of the sets of elements in the filters.

  - When playing long enough with probabilities and making some
    reasonable assumptions and approximations it can be shown (see the
    Wikipedia article) that for a desired false positive probability
    'p' there is an optimal number of 'k' hash functions:

      k = -logâ‚‚(p)    (that's a base 2 logarithm there)

    For 1% false positive probability k = 6.64, but of course there
    can only be an integral number of hash functions, so 7 it is.

    Note that this value is independent both from the size of the
    Bloom filter and from the number of elements in there.

    Inversely, when using a properly sized bit array, then the false
    positive probability falls exponentially as 'k' increases.

  - To store 'n' elements in a Bloom filter with a desired false
    positive probability using the optimal number of 'k' bits/hash
    functions, we need a bit array with approximate size 'm':

      m ≈ n * k / ln(2) ≈ n * k * 10 / 7

    When using 7 hash functions to aim for < 1% false positive
    probability this simplifies down to:

      m ≈ n * 7 * 10 / 7 = 10 * n

    i.e. approx. 10 bits per element.

  - In general the more elements, IOW the more set bits there are in a
    Bloom filter of certain size, the higher the probability of false
    positives.  Similarly, the larger the size of a Bloom filter's bit
    array for a certain number of elements, the lower the probability
    of false positives.

  - A Bloom filter with all bits set appears to contain "everything",
    because all queries will return 'possibly in set'.

Modified Path Bloom Filters
---------------------------

We'll use Bloom filters to store modified paths and will store them in
the Modified Path Bloom Filters chunk.  Yes, plural: we'll use a lot
of small Bloom filters, each one storing all paths modified by one
commit compared to one of its parents (see the Alternatives section
near the end for reasons for not using one big Bloom filter).  Then as
the pathspec-limited revision walk iterates over all commits it will
query each modified path Bloom filter to see whether the given
pathspec is in there, and if it is 'definitely not in set', then we
can spare a tree-diff call and skip the commit right away, but if it's
'possibly in set', then we must do the tree-diff to make sure.

Each modified path Bloom filter consists of:

  - 4 bytes specifying the number of bits in the Bloom filter's bit
    array.

    For practical purposes 32 bit is more than sufficient to store the
    number of bits in the Bloom filter's array.  When using k = 7
    hashes, i.e. 10 bits per path, then we could store over 400
    million paths in a single Bloom filter; with k = 32 hashes we'd
    use 46 bit per path, which is still over 93 million paths.

  - The actual bit array.  See the next and the Hashing scheme
    sections for details about how this bit array is used.

All modified path Bloom filters will use the same k number of hashes
per path, so we won't have to store that in each Bloom filter.  I
suspect that having modified path Bloom filters with different k
number of hashes wouldn't bring enough benefit to justify storing that
value in each and every filter.

The order of modified path Bloom filters in this chunk is unspecified
on purpose, so implementations can experiment with writing them in
history or topological order, which may bring performance benefits
through better locality.

[Somewhere in the middle of this section the length of this commit
message surpassed the length of 72441af7c4 (tree-diff: rework
diff_tree() to generate diffs for multiparent cases as well,
2014-04-07), the mother of all commit messages from the great Kirill
Smelkov, yay! :)]

Modified Path Bloom Filter Index
--------------------------------

Since modified path Bloom filters vary in size, we can't have an array
of modified path Bloom filters indexed by the position of the commit
OID in the OID Lookup chunk, but we'll need a level of indirection
between the commit OID position and the position of the associated
modified path Bloom filter.

So the Modified Path Bloom Filter Index chunk will contain an array of
offsets to quickly map commit OID positions to offsets in the Modified
Path Bloom Filters chunk, i.e. its Nth entry contains the offset for
the commit whose OID is Nth in the OID Lookup chunk.

Since the commit-graph file can store information about 2^31-1
commits, a merely 4 byte offset per commit is clearly insufficient.
However, surely noone want to have lots of gigabyte-sized modified
path Bloom filters, so a standard 8 byte file offset will be
underutilized.  This allows us to use a few bits from those 8 bytes
for special purposes.  For now we'll have two special purpose bits:

  - The most significant bit indicates that the entry is not an offset
    into the Modified Path Bloom Filters chunk, but an "embedded"
    modified path Bloom filter containing all paths modified by the
    commit compared to its first parent; see the next section.

  - The second most significant bit indicates that modified path Bloom
    filters are stored not only for the first parent of a merge
    commit, but for all its parents, and the entry is neither an
    offset nor an "embedded" modified path Bloom filter, but an array
    index into the Modified Path Bloom Filter Merge Index chunk; see
    the "Merges" section below.

  - [TODO: perhaps we might want to reserve one or two more bits for
    special purposes?]

Make these offsets relative to the start of the Modified Path Bloom
Filters chunk, so they will depend neither on the number of chunks nor
on the combined size of any other chunks that are written before the
Modified Path Bloom Filters chunk, thus allowing implementations to
calculate all these offsets without knowing anything about those other
chunks.  Furthermore, if the offsets were relative to the beginning of
the file, then some huge chunks could make the offsets grow too large,
and would mess up those special purpose bits (though, arguably, an
exabyte sized commit-graph file is just too unlikely to be worried
about...).

An "all bits set" index entry can be used to indicate that there is no
modified path Bloom filter stored for the corresponding commit.  A
reader implementation can either special-case such an entry, or
interpret it as an embedded modified path Bloom filter that replies
'possibly in set' to all queries, the result is the same: it has to
resort to running tree-diff for that commit.

These embedded modified path Bloom filters have implications on the
low-level Bloom filter format in the Modified Path Bloom Filters chunk
as well, namely:

  - Their array contains only 63 bits, not 64, i.e. not 8 full bytes.
    Therefore, for simplicity, we'll store the size of the bit array
    as the number of bits, not bytes.

  - Assigning special purpose to the most significant bits of the
    index entries is convenient when the entry is an offset into the
    Modified Path Bloom Filters chunk.  OTOH, it makes indexing into
    the Bloom filter's array awkward if we try to treat it like an
    ordinary array, i.e. whose 0th element comes first, because we'd
    have to account for the special purpose bits.  Therefore, we'll
    store the bit array kind of in big endian byte order, i.e. the
    most significant byte first and the 0th byte last.

In addition, this chunk will start with a small header storing a bit
of metadata that applies to all modified path Bloom filters in all
related chunks.  The reason for storing this header in the Modified
Path Bloom Filter Index chunk is that only this chunk is essential to
use modified path Bloom filters (if all commits modify so few files
that their modified path Bloom filters can all be embedded into the
index chunk, then the Modified Path Bloom Filters chunk will contain
nothing and thus can be omitted; e.g. the two homebrew repositories
come quite close to this, as shown below).

This header contains:

  - One byte storing the number of 'k' hashes per path.

    Using a single byte is more than sufficient: as 'k' increases the
    false positive probability falls exponentially, and quickly
    becomes unnecessarily low (i.e. with k = 31 even a full
    commit-graph containing 2^31-1 commits is expected to have only a
    single false positive).

  - [TODO: What else?  We might want to have a version byte, though if
    a format change becomes necessary, then we could simply rename the
    chunks just as well...]

Embedded Modified Path Bloom Filters
------------------------------------

The ideal size of a Bloom filter to store a set of 6 or 7 elements
with 7 hash functions is approximately 60 or 70 bits, respectively.
So in the space of a 8 byte Modified Path Bloom Filter Index entry we
could comfortably store 6 entries by using one bit to indicate that
the entry is not a file offset but an "embedded" modified path Bloom
filter and the remaining 63 bits as the Bloom filter's bit array.

As the table below shows, a significant portion or even the vast
majority of commits modify no more than 6 paths, so we can embed a lot
of modified path Bloom filters into the Modified Path Bloom Filter
Index chunk.  (Also note the unusually large percentage of empty diffs
in the android-base and cpython repositories.)

        Percentage of commits modifying <=N (or =N) paths compared to their first parents
                   0      <=1      <=2      <=3      <=4      <=5      <=6      <=7      <=8
                          (=1)     (=2)     (=3)     (=4)     (=5)     (=6)     (=7)     (=8)
  --------------------------------------------------------------------------------------------
  elasticsearch  0.70%   4.00%    5.67%    8.50%   13.17%   16.55%   18.32%   21.18%   27.47%
                        (3.30%)  (1.67%)  (2.83%)  (4.67%)  (3.37%)  (1.77%)  (2.86%)  (6.29%)
  jdk            0.26%   3.47%   10.60%   13.99%   15.62%   19.02%   26.62%   34.30%   40.57%
                        (3.20%)  (7.14%)  (3.39%)  (1.63%)  (3.40%)  (7.60%)  (7.68%)  (6.27%)
  webkit         0.05%   0.07%    0.77%    2.14%    9.15%   26.66%   38.42%   47.34%   52.83%
                        (0.02%)  (0.71%)  (1.37%)  (7.01%) (17.51%) (11.76%)  (8.92%)  (5.49%)
  android-base  13.20%  13.62%   14.23%   18.55%   20.91%   35.18%   42.32%   50.82%   62.05%
                        (0.42%)  (0.62%)  (4.32%)  (2.36%) (14.28%)  (7.14%)  (8.49%) (11.23%)
  llvm-project   0.12%   0.12%    0.94%    6.45%   25.24%   46.68%   53.60%   60.97%   67.33%
                        (0.00%)  (0.81%)  (5.51%) (18.79%) (21.44%)  (6.92%)  (7.37%)  (6.36%)
  gecko-dev      0.14%   0.96%    1.88%   15.44%   32.42%   46.12%   54.54%   61.37%   66.65%
                        (0.82%)  (0.92%) (13.56%) (16.98%) (13.70%)  (8.42%)  (6.82%)  (5.28%)
  tensorflow     0.09%   1.26%    2.72%    5.00%   26.30%   42.36%   55.17%   63.11%   69.70%
                        (1.17%)  (1.46%)  (2.28%) (21.27%) (16.07%) (12.81%)  (7.94%)  (6.59%)
  rails          0.10%   2.09%    5.79%   16.03%   35.57%   51.47%   58.71%   65.15%   72.96%
                        (1.99%)  (3.70%) (10.23%) (19.54%) (15.90%)  (7.24%)  (6.44%)  (7.82%)
  rust           0.07%   2.20%    5.11%   22.81%   42.35%   52.50%   59.29%   65.33%   70.02%
                        (2.13%)  (2.91%) (17.70%) (19.54%) (10.15%)  (6.79%)  (6.04%)  (4.69%)
  glibc          0.02%   7.33%   14.03%   30.86%   42.22%   52.53%   61.59%   68.50%   73.68%
                        (7.31%)  (6.70%) (16.83%) (11.36%) (10.32%)  (9.06%)  (6.91%)  (5.19%)
  gcc            0.00%   0.24%   10.92%   26.61%   39.85%   54.97%   63.80%   69.37%   76.52%
                        (0.24%) (10.68%) (15.69%) (13.24%) (15.13%)  (8.82%)  (5.57%)  (7.14%)
  go             0.00%   0.96%    9.09%   19.35%   39.97%   53.16%   65.31%   72.10%   77.36%
                        (0.95%)  (8.13%) (10.26%) (20.63%) (13.19%) (12.15%)  (6.79%)  (5.26%)
  cmssw          0.15%   0.19%    0.20%    2.43%   45.35%   56.49%   67.58%   73.17%   77.99%
                        (0.03%)  (0.01%)  (2.23%) (42.92%) (11.15%) (11.08%)  (5.59%)  (4.83%)
  linux          0.01%   0.66%    3.97%   23.49%   46.15%   62.97%   72.79%   79.16%   83.57%
                        (0.65%)  (3.30%) (19.52%) (22.66%) (16.82%)  (9.82%)  (6.37%)  (4.41%)
  cpython        3.07%   4.97%   27.73%   59.54%   70.34%   77.48%   81.91%   86.76%   89.34%
                        (1.91%) (22.75%) (31.82%) (10.80%)  (7.13%)  (4.43%)  (4.85%)  (2.59%)
  git            0.11%  27.54%   55.92%   73.79%   81.90%   87.05%   90.28%   92.29%   93.82%
                       (27.43%) (28.38%) (17.87%)  (8.11%)  (5.15%)  (3.23%)  (2.01%)  (1.53%)
  homebrew-cask  0.40%   0.94%   95.41%   97.42%   98.11%   98.40%   98.61%   98.79%   98.93%
                        (0.54%) (94.46%)  (2.01%)  (0.70%)  (0.29%)  (0.21%)  (0.18%)  (0.14%)
  homebrew-core  0.01%   0.07%   98.81%   99.35%   99.56%   99.75%   99.81%   99.84%   99.86%
                        (0.07%) (98.74%)  (0.53%)  (0.22%)  (0.19%)  (0.06%)  (0.03%)  (0.02%)

This saves space, because the Modified Path Bloom Filters chunk will
contain a lot less Bloom filters (albeit those would be rather small
filters).

It reduces the probability of false positives, because all commits
modifying 1-6 paths will have larger than strictly necessary modified
path Bloom filters.

Finally, it makes the Bloom filter query ever so slightly faster,
partly because there is no redirection into the Modified Path Bloom
Filters chunk, and partly because we can check all bit positions in a
63 bit modified path Bloom filter using a 64 bit mask at once, instead
of checking those bit positions one by one, and we can use the same
mask to check all embedded Bloom filters.

The number of paths that can be stored in a 63 bit Bloom filter
depending on the number of hashes per path:

  k     |  3   |  4   |  5  |  6  |  7  |  8  | 9 - 11 | 12 - 14 | 15 - 22 | 23 - 44
  ------+------+------+-----+-----+-----+-----+--------+---------+---------+--------
  paths | <=14 | <=11 | <=8 | <=7 | <=6 | <=5 |  <=4   |   <=3   |   <=2   |   <=1

Hashing scheme
--------------

We need to map each modified path to 'k' independent bit positions in
the Bloom filters bit array.  We want the hash function and hashing
scheme that results in the lowest false positive rate and has the
lowest probability of "colliding" paths (see below), but it should
still be fast to compute, and should be widely available for
alternative Git implementations.

At this point we don't care about the runtime of pathspec-limited
revision walks here.  Lower false positive rate inherently leads to
lower runtime, though we are reaching diminishing returns on common
setups as the false positive rate gets lower and lower... and e.g. in
the webkit repository we'll reach an average false positive rate of
~0.001% with only k = 7 hashes per path.  However, on unusual
setups/configurations accessing tree objects might be considerably
more expensive than accessing commit objects, especially when using
only commit info stored in the commit-graph.  E.g. consider a future
where we can distribute commit-graphs with modified path Bloom filters
to partial clones containing only commit objects for most of the
history: any tree-diff will be really expensive, because the tree
objects must be fetched from the promisor.  In such a setup every
avoidable tree-diff call counts, and low false positive rate is
king.

For now I went with 32 bit MurmurHash3 used in enhanced double hashing
scheme with 32 bit unsigned integer arithmetic, though as I will show
below it seems that this is not the best option.

So, to map each modified path to 'k' bit positions in the Bloom
filter's array we first need 'k' independent hashes.  In general,
hashing a path 'k' times with the same hash function but using 'k'
different seeds produces hashes that are independent enough.  In
practice, to reduce the overhead of hashing, especially for larger 'k'
values, some variant of double hashing is often used to generate the
'k' independent-ish hashes from the results of only two hash function
calls with different seeds.  In general:

  h1 = h(seed1, path)
  h2 = h(seed2, path)
  for (i = 0; i < k; i++)
      pos[i] = (h1 + i * h2 + f(i)) % nr_bits

Depending on how the f(i) term is defined there are a few named
variants:

  - Double hashing: f(1) = 0:

      pos[i] = (h1 + i * h2) % nr_bits

  - Improved double hashing: f(i) adds a simple quadratic term:

      pos[i] = (h1 + i * h2 + i^2) % nr_bits

  - Enhanced double hashing: f(i) adds a not-so-simple cubic term:

      pos[i] = (h1 + i * h2 + (i^3 - i) / 6) % nr_bits

    This cubic term is equal to the following sequence of numbers,
    starting from i = 0:

      0, 0, 1, 4, 10, 20, 35, 56, 84, 120, etc.

    These are almost the same as tetrahedral numbers, except that the
    mathematical definition starts with 1, 4, 10..., while the OEIS
    A000292 sequence starts with 0, 1, 4, 10...

    I'm puzzled by this term being 0 not only for i = 0, but for i = 1
    as well, because this means that if (h2 % nr_bits == 0) (see
    below), then both pos[0] = h1 and pos[1] = h1, IOW in that case
    the path has one less bit set.  We'll take a look at whether
    starting the sequence with a single 0 makes a difference (it does)
    and how that affects the false positive rate (slightly increases
    it overall), and this will be labeled "enhanced double hashing
    variant #1" below.

    This sequence is supposed to be just as good as a simple i^3 term,
    but this can easily be calculated incrementally without
    multiplication, using only addition.  We'll take a look at how a
    simple cubic term affects the false positive rate (it increases
    it), and this will be labeled "enhanced double hashing variant
    #2" below.

These hashing schemes usually work fairly well in practice, but in
practice Bloom filters are primarily used to check whether an element
is part of _one_ _big_ set, and, consequently, most of the wisdom and
experience out there applies to big Bloom filters.

We, however, will repeatedly check whether a particular element (i.e.
path) is part of a large number of mostly very small sets, and our use
case does challenge those best practices.

In our use case it is important when two paths "collide", i.e. when
one path in itself sets all the bit positions that (falsely) indicate
the presence of another path in a modified path Bloom filter, so let's
have a look at that.  And let's hope against hope that I don't mess up
the math...

So, if we have k truly independent hashes for each path, then:

  (1) The probability of two paths mapping to the same k separate bit
      positions in a Bloom filter of size nr_bits is nr_bits! /
      ((nr_bits - k)! * k!).

  (2) The probability of a path setting only a single bit position is
      1 / nr_bits^(k-1), and the probability of a path setting one
      particular bit position is k / nr_bits (assuming that it does
      set k separate bits), so the probability of a path setting a bit
      position that happens to be the only bit position set by an
      other path is k / nr_bits^k.

In case of our 63 bit embedded modified path Bloom filters and k = 7
the probabilities of these two cases are about 1.81 * 10^(-9) and
1.77 * 10^(-12), respectively.  I think that these are low enough not
to worry about.

However, the hashes in the various double hashing schemes are far from
being independent.  When starting out with unsigned 32 bit hashes
(like we'll do) and using 64 bit arithmetic, then there is no integer
overflow, because neither i nor f(i) are large enough for that.  Then,
due to the distributivity of the modulo operation (and because i <
nr_bits), all double hashing schemes are equivalent to:

  pos[i] = (h1 % nr_bits + i * (h2 % nr_bits) + f(i) % nr_bits) % nr_bits

For the above mentioned two colliding cases this means:

  (1) if (h(seed1, "foo") % nr_bits == h(seed1, "bar") % nr_bits &&
          h(seed2, "foo") % nr_bits == h(seed2, "bar") % nr_bits)

      then all double hashing variants will work as if both paths were
      hashed to the same h1 and h2 values, and, consequently, both
      paths will be mapped to the same bit positions.  This has the
      probability of 1 / nr_bits^2.

  (2) if (h2 % nr_bits == 0)

      then the i * h2 term will basically amount to nothing and this
      term can be simplified away, and all bit positions will depend
      only on h1; this has the probability of 1 / nr_bits.

      In case of double hashing the f(i) term is 0 as well, meaning
      that the path maps to a single bit position.  The probability of
      a path setting a bit position that happens to be the only bit
      position set by an other path is k / nr_bits^2.

      The quadratic or cubic f(i) terms in improved or enhanced double
      hashing ensure that a path is mapped to multiple bit positions
      even in this case, though those bit positions are not nearly as
      random as one would like.

In case of 63 bit Bloom filters and k = 7, the probabilities of these
two cases are 1 / 3969 and 1 / 567, respectively.

When using 32 bit unsigned integer arithmetic, then an integer
overflow is definitely possible, so the double hashing formula
becomes:

  pos[i] = (h1 + i * h2 + f(i)) % 2^32 % nr_bits

If nr_bits is a power of two, then the "% 2^32" term can be simplified
away, and we end up with the same formula as with 64 bit arithmetic,
and the probabilities of cases (1) and (2) above remain the same.

If nr_bits is not a power of two, then...  well, I don't offhand know
how to approach that formally :)  Anyway:

  (1) This type of collision seems to occur if that two-liner
      condition above is true and for both paths there is an overflow
      for the same values of i, which has the approximate probability
      of 1 / ((k - 1) * nr_bits^2).

  (2) This type of collision seems to occur if (h2 % nr_bits == 0) and
      there is either no integer overflow for any values of i or if
      there is an overflow for all values of i, which has the
      approximate probability of k / ((k - 1) * nr_bits^2).

In case of 63 bit Bloom filters and k = 7 the probabilities of these
two cases are 1 / 23814 and 1 / 3402, respectively.

There are several other colliding cases, e.g. with double hashing
variants its more probable that a path maps only to 2, 3, etc. bit
positions instead of 'k' than with 'k' truly independent hashes,
though I haven't looked into how the probabilities work out.

Anyway, all the above already shows that:

  - These colliding cases are several orders of magnitude more likely
    with any double hashing variant than with k truly independent hash
    functions.

  - When using some sort of double hashing, then these colliding cases
    can happen with a high enough probability that we can't just
    ignore them.

  - These colliding cases can happen much more frequently with double
    hashing than with improved or enhanced double hashing.

  - Using 64 vs 32 bit arithmetic while calculating various double
    hashing schemes makes a difference, and suggests that 32 bit
    arithmetic has lower false positives probability.

  - Bloom filters whose size is a power of two might have higher false
    positive probability.

All this is important, because there are repositories out there that
modify the same path in the majority of commits, e.g.:

  - homebrew-core: contains 6535 paths, and over 99.5% of commits
    modify the 'Formula' directory and have an embedded modified path
    Bloom filter.

  - homebrew-cask: contains 7307 paths, and over 95.5% of commits
    modify the 'Casks' directory and have an embedded modified path
    Bloom filter.

  - rust: contains 58k+ paths, and almost 93% of commits modify the
    'src' directory, though only over 54% of commits modify that
    directory and have an embedded modified path Bloom filter.

  - tensorflow: contains 47k+ paths, and almost 91% of commits modify
    the 'tensorflow' directory, though only about 51% of commits
    modify that directory and have an embedded modified path Bloom
    filter.

  - go: contains 22k+ paths, and almost 84% of commits modify the
    'src' directory, though only over 50% of commits modify that
    directory and have an embedded modified path Bloom filter.

So e.g. if we were to look for commits modifying a path in the
homebrew-core repository, which happens to map to the same bit
positions in a 63 bit Bloom filter as the 'Formula' directory, then
Boom! we would get over 99.5% false positive rate, effectively
rendering modified path Bloom filters useless for that particular
path.

The table below shows the number of paths that happen to collide with
the repository's frequently modified directory in embedded modified
path Bloom filters using different hash functions and hashing schemes
with 32 bit unsigned integer arithmetic and 64 bit arithmetic (the
latter in parentheses):

                |    Double hashing   |      Enhanced     |      7 seeds
                |                     |   double hashing  |
                |  Murmur3 |  xxHash  | Murmur3 |  xxHash | Murmur3 | xxHash
  --------------+----------+----------+---------+---------+---------+--------
  homebrew-core |  4  (15) |  5  (17) |  0  (0) |  0  (2) |    0    |    0
  homebrew-cask |  2  (21) |  6  (14) |  0  (2) |  1  (1) |    0    |    0
  rust          | 18 (143) | 38 (144) |  1 (15) |  6 (24) |    0    |    0
  tensorflow    | 20 (110) | 18 (105) |  4 (15) |  0 (12) |    0    |    0
  go            |  9  (66) | 12  (62) |  0  (1) |  3  (8) |    0    |    0
  --------------+----------+----------+---------+---------+---------+--------
  all           | 53 (355) | 79 (342) |  5 (33) | 10 (47) |    0         0

The effect of embedded modified path Bloom filters on these colliding
cases can be both beneficial and harmful:

  - We use a larger than necessary Bloom filter to hold 1-6 paths,
    which lowers the probability of these cases considerably.

    This is especially important for the homebrew repositories.  E.g.
    in homebrew-core over 98.6% of commits modify a single file in the
    'Formula' directory, i.e. two paths in total.  To store 2 paths
    using 7 hashes per path we would need a Bloom filter with a 20 bit
    array, which we would round up to 24 bits to use full bytes, i.e.
    those 98.6% of commits would have merely 24 bit Bloom filters.
    This makes those colliding cases all the more probable: 1 / 3456
    for case (1) and 1 / 493.7 for case (2) with 32 bit arithmetic).

  - We use Bloom filters of the same size to hold 1-6 paths, so if a
    path were to run into these cases, then more Bloom filter queries
    would return false positives than when using appropriately sized
    Bloom filters.

Now, there is a simple trick that can, to some extent, alleviate these
collision issues: check all leading directories of the pathspecs, i.e.
while looking for commits modifying 'dir/subdir/file', then query the
modified path Bloom filters not only for the full path, but for all
its leading directories 'dir/subdir' and 'dir' as well.  This way we
would only get a false positive if all bit positions of all leading
directories were set as well, which can significantly reduce the
probability of a pathspec running afoul of these colliding cases, and,
in general, can reduce the false positive rate by an order of
magnitude or three as well (see later in this patch series).

Checking all leading directories is not a silver bullet, though,
because it can only help if the pathspec does actually have leading
directories, and the deeper the pathspec in the directory tree, i.e.
the more leading directories it has, the lower the false positive
rate becomes.

  - So this doesn't help pathspecs in the root tree, because they
    obviously don't have any leading directories.

  - Furthermore, this doesn't help pathspecs that are immediately
    below such a frequently modified directory, because their only
    leading directory is modified in the majority of commits.

    This means that it can't help in the homebrew repositories,
    because 85% or 90% of their paths are directly under their
    frequently modified directories.

The only thing that can help even in these cases is hashing the paths
k times using k different seeds.

Phew.  Let's see some actual benchmarks, shall we?

The following tables in this section show the average false positive
rate of various hash functions and hashing schemes while listing the
histories of "a lot of paths", i.e. 'git rev-list HEAD -- $path'.  A
'*' denotes the lowest value in each row.  All cases use k = 7 hashes
per path, use the same basically random seeds of immense historical
significance, store 6 path in embedded modified path Bloom filters,
and check all leading directories of the given pathspec.

The table below compares the average false positive rate of 64 bit and
32 bit unsigned integer arithmetic using double hashing and enhanced
double hashing with the MurmurHash3 hash function:

                    Double hashing     | Enhanced double hashing
                   32 bit     64 bit   |    32 bit     64 bit
  -------------------------------------+------------------------
  android-base   0.008539%  0.015709%  | *0.004155%  0.004961%
  cmssw          0.006022%  0.013334%  | *0.003953%  0.004109%
  cpython        0.036840%  0.079816%  | *0.016607%  0.019330%
  elasticsearch  0.004069%  0.005297%  |  0.003249% *0.003101%
  gcc            0.016982%  0.034096%  | *0.008919%  0.011152%
  gecko-dev      0.001058%  0.002691%  | *0.000725%  0.000829%
  git            0.144256%  0.331346%  | *0.069921%  0.079405%
  glibc          0.026480%  0.053838%  | *0.016389%  0.017902%
  go             0.021930%  0.050348%  | *0.012616%  0.014178%
  homebrew-cask  0.097523%  0.508175%  | *0.009096%  0.042034%
  homebrew-core  0.120860%  0.556014%  | *0.005360%  0.026810%
  jdk            0.006085%  0.007526%  |  0.006431% *0.005911%
  linux          0.010908%  0.019081%  | *0.007494%  0.007896%
  llvm-project   0.006417%  0.009327%  | *0.003913%  0.004050%
  rails          0.024997%  0.046829%  |  0.013134% *0.012361%
  rust           0.038579%  0.056852%  | *0.025509%  0.027068%
  tensorflow     0.013732%  0.023307%  | *0.008243%  0.008848%
  webkit         0.002212%  0.002950%  | *0.001007%  0.001065%
  -------------------------------------+------------------------
  all            0.028395%  0.110075%  | *0.006085%  0.006675%
  w/o homebrew   0.010940%  0.021286%  | *0.005968%  0.011023%

The 64 bit unsigned arithmetic does indeed fare worse in almost every
case, and significantly worse in the two homebrew repositories.

So it seems that if using some form of double hashing, then 32 bit
unsigned integer arithmetic is the way to go, even though several
programming languages lack support for unsigned types (though thanks
to the distributivity of the modulo operation, they can simply and
cheaply implement it using 64 bit arithmetic and a '& (2^32-1)' mask).

The table below compares the average false positive rate of various
hashing schemes using 32 bit unsigned integer arithmetic and the
MurmurHash3 hash function:

                              Enhanced   Enhanced   Enhanced
                               double     double     double    Improved
                               hashing    hashing    hashing    double     Double
                   7 seeds   (original)  variant 1  variant 2   hashing    hashing
  ---------------------------------------------------------------------------------
  android-base    0.004214%  *0.004155%  0.004853%  0.005193%  0.008252%  0.008539%
  cmssw          *0.003344%   0.003953%  0.003546%  0.004732%  0.004226%  0.006022%
  cpython         0.016120%   0.016607% *0.015896%  0.020259%  0.020311%  0.036840%
  elasticsearch  *0.003167%   0.003249% *0.003164%  0.003531%  0.003553%  0.004069%
  gcc             0.010281%  *0.008919%  0.010359%  0.010642%  0.012166%  0.016982%
  gecko-dev       0.000804%   0.000725% *0.000646% *0.000648%  0.000822%  0.001058%
  git            *0.063025%   0.069921%  0.067643%  0.080744%  0.091423%  0.144256%
  glibc          *0.016278%   0.016389%  0.018660%  0.021094%  0.025596%  0.026480%
  go              0.013009%  *0.012616%  0.013301%  0.014316%  0.016345%  0.021930%
  homebrew-cask  *0.007199%   0.009096%  0.009070%  0.010722%  0.016617%  0.097523%
  homebrew-core  *0.003041%   0.005360%  0.005277%  0.005148%  0.016150%  0.120860%
  jdk             0.005873%   0.006431% *0.005326%  0.005955%  0.006984%  0.006085%
  linux           0.007764%  *0.007494%  0.007714%  0.008709%  0.009429%  0.010908%
  llvm-project   *0.003367%   0.003913%  0.003730%  0.004064%  0.004809%  0.006417%
  rails          *0.012708%   0.013134%  0.013929%  0.016316%  0.014358%  0.024997%
  rust           *0.024245%   0.025509%  0.025045%  0.028204%  0.032491%  0.038579%
  tensorflow     *0.007907%   0.008243%  0.008670%  0.009913%  0.010115%  0.013732%
  webkit         *0.000999%  *0.001007%  0.001142%  0.001113%  0.001274%  0.002212%
  ---------------------------------------------------------------------------------
  all            *0.005646%   0.006085%  0.006226%  0.006928%  0.009264%  0.028395%
  w/o homebrew   *0.005888%   0.005968%  0.006152%  0.006899%  0.007807%  0.010940%

Double hashing and improved double hashing have higher average false
positive rates than enhanced double hashing; note in particular the
significantly higher false positive rate with double hashing in the
two homebrew repositories.  Enhanced double hashing is only slightly
worse than 7 different seeds, at least in this particular case (i.e.
MurmurHash3 and these specific seeds).

Comparing these hashing schemes using a different hash function
(xxHash or FNV1a) shows a similar trend; for brevity I won't include
those tables here.

The table below compares the average false positive rate of different
32 bit hash functions when used in enhanced double hashing scheme with
k = 7 and 32 bit unsigned int arithmetic, or with 7 different seeds:

                     Enhanced double hashing     |             7 seeds             |  7 uint32
                  Murmur3     xxHash     FNV1a   |  Murmur3     xxHash     FNV1a   |   SHA256
  -----------------------------------------------+---------------------------------+-----------
  android-base   0.004155%  0.005556%  0.006113% | 0.004214% *0.004101%  0.005918% | 0.004168%
  cmssw          0.003953%  0.003775%  0.005087% | 0.003344%  0.003677%  0.004353% |*0.003322%
  cpython        0.016607%  0.015919%  0.021957% | 0.016120% *0.015238%  0.023649% | 0.017632%
  elasticsearch  0.003249%  0.003589%  0.004616% | 0.003167%  0.003360%  0.003586% |*0.003027%
  gcc           *0.008919%  0.009376%  0.010555% | 0.010281%  0.009157%  0.011882% | 0.010338%
  gecko-dev      0.000725%  0.000721%  0.000932% | 0.000804% *0.000611%  0.000911% | 0.000737%
  git            0.069921%  0.063449%  0.083097% |*0.063025%  0.069137%  0.087132% | 0.063763%
  glibc          0.016389%  0.017477%  0.024321% | 0.016278% *0.016062%  0.022641% | 0.017241%
  go             0.012616%  0.012449%  0.016728% | 0.013009% *0.011692%  0.017214% | 0.012489%
  homebrew-cask  0.009096%  0.025104%  0.011073% | 0.007199%  0.007343%  0.009037% |*0.007050%
  homebrew-core  0.005360%  0.007940%  0.005450% |*0.003041%  0.003832%  0.004888% | 0.003884%
  jdk            0.006431% *0.005451%  0.007428% | 0.005873%  0.005738%  0.006654% | 0.005749%
  linux         *0.007494%  0.008266%  0.009917% | 0.007764%  0.007723%  0.009256% | 0.007939%
  llvm-project   0.003913%  0.003727%  0.004584% | 0.003367% *0.003247%  0.005115% | 0.003664%
  rails          0.013134%  0.011277%  0.014103% | 0.012708% *0.010389%  0.015928% | 0.012506%
  rust           0.025509%  0.024596%  0.032983% |*0.024245%  0.025697%  0.031593% | 0.024794%
  tensorflow     0.008243%  0.008743%  0.011499% |*0.007907%  0.008089%  0.011922% | 0.008033%
  webkit         0.001007%  0.001155%  0.001401% | 0.000999% *0.000962%  0.001291% | 0.001054%
  -----------------------------------------------+---------------------------------+----------
  all            0.006085%  0.007327%  0.007518% | 0.005646% *0.005554%  0.007591% | 0.005857%
  w/o homebrew   0.005968%  0.005979%  0.007545% | 0.005888% *0.005660%  0.007855% | 0.006039%

MurmurHash3 and xxHash are neck and neck, be it enhanced double
hashing or 7 different seeds, at least when we ignore the unusually
high false positive rate of enhanced double hashing with xxHash in the
homebrew-cask repository (it stumbles upon one of those colliding
cases discussed above).

FNV1a has a decidedly higher average false positive rate than any of
the others.

I was curious to see whether using 7 unsigned integers from SHA256
offers any benefits (being a cryptographic hash function, it should
provide some high quality hash values), but apparently it doesn't fare
any better than MurmurHash3 and xxHash.  This leads me to believe that
both MurmurHash3 and xxHash are as good as it gets, and I would not
expect that any other hash function could achieve notably lower false
positive rates.

Now let's see how these hash functions and hashing schemes fare when
writing commit-graph files with '--reachable' and with modified path
Bloom filters from scratch at the end of this patch series.  Hash
functions tend to praise themselves about how fast they can process
huge chunks of data, but we'll use them to hash many tiny strings...

                         Total runtime of writing a commit-graph file
                         with modified path Bloom filters from scratch
                                    (Time spent hashing)

                     Enhanced double hashing     |             7 seeds             |  7 uint32
                  Murmur3     xxHash     FNV1a   |  Murmur3     xxHash     FNV1a   |   SHA256
  -----------------------------------------------+---------------------------------+----------
  android-base    40.880s    40.368s    40.569s  |  41.375s    40.557s    41.843s  |  42.706s
                  (0.627s)   (0.484s)   (0.777s) |  (1.467s)   (0.886s)   (2.138s) |  (3.682s)
  cmssw           25.691s    25.224s    25.645s  |  26.715s    25.998s    27.762s  |  29.527s
                  (0.894s)   (0.650s)   (1.179s) |  (2.083s)   (1.235s)   (3.259s) |  (5.077s)
  cpython          8.951s     8.929s     9.067s  |   9.072s     9.015s     9.008s  |   9.275s
                  (0.057s)   (0.042s)   (0.045s) |  (0.112s)   (0.067s)   (0.102s) |  (0.324s)
  elasticsearch   14.470s    14.320s    14.703s  |  14.983s    14.588s    15.948s  |  16.720s
                  (0.407s)   (0.300s)   (0.622s) |  (0.991s)   (0.594s)   (1.760s) |  (2.332s)
  gcc             36.917s    36.724s    37.251s  |  37.971s    37.449s    38.178s  |  38.332s
                  (0.313s)   (0.230s)   (0.346s) |  (0.694s)   (0.418s)   (0.919s) |  (1.796s)
  gecko-dev       97.729s    96.791s    97.233s  |  99.215s    97.158s   101.403s  | 105.332s
                  (1.730s)   (1.267s)   (2.099s) |  (3.902s)   (2.344s)   (5.773s) |  (9.553s)
  git              5.245s     5.401s     5.412s  |   5.518s     5.457s     5.474s  |   5.494s
                  (0.022s)   (0.017s)   (0.018s) |  (0.045s)   (0.027s)   (0.040s) |  (0.124s)
  glibc            4.146s     4.156s     4.187s  |   4.267s     4.201s     4.278s  |   4.495s
                  (0.060s)   (0.045s)   (0.057s) |  (0.128s)   (0.079s)   (0.144s) |  (0.331s)
  go               3.565s     3.563s     3.564s  |   3.631s     3.582s     3.607s  |   3.727s
                  (0.040s)   (0.030s)   (0.035s) |  (0.084s)   (0.051s)   (0.084s) |  (0.221s)
  homebrew-cask   29.818s    29.936s    30.279s  |  30.316s    30.435s    29.942s  |  29.939s
                  (0.025s)   (0.020s)   (0.019s) |  (0.049s)   (0.032s)   (0.038s) |  (0.153s)
  homebrew-core   55.478s    55.534s    56.248s  |  56.551s    56.100s    56.445s  |  55.641s
                  (0.031s)   (0.024s)   (0.023s) |  (0.062s)   (0.038s)   (0.047s) |  (0.195s)
  jdk             19.418s    19.151s    20.246s  |  21.270s    20.037s    24.073s  |  25.916s
                  (1.260s)   (0.878s)   (2.105s) |  (3.154s)   (1.833s)   (6.133s) |  (7.732s)
  linux          100.837s   100.130s   101.244s  | 103.775s   101.645s   103.856s  | 109.365s
                  (2.027s)   (1.498s)   (2.030s) |  (4.319s)   (2.682s)   (5.316s) | (11.075s)
  llvm-project    31.188s    31.392s    31.442s  |  31.895s    31.479s    31.984s  |  32.863s
                  (0.334s)   (0.251s)   (0.345s) |  (0.724s)   (0.437s)   (0.895s) |  (1.794s)
  rails            5.607s     5.639s     5.694s  |   5.742s     5.720s     5.865s  |   6.087s
                  (0.084s)   (0.063s)   (0.095s) |  (0.189s)   (0.116s)   (0.252s) |  (0.467s)
  rust            13.250s    13.206s    13.422s  |  13.701s    13.426s    13.680s  |  13.949s
                  (0.163s)   (0.122s)   (0.169s) |  (0.359s)   (0.217s)   (0.440s) |  (0.880s)
  tensorflow      11.808s    11.608s    11.915s  |  12.379s    11.966s    12.860s  |  13.588s
                  (0.368s)   (0.267s)   (0.497s) |  (0.860s)   (0.517s)   (1.369s) |  (2.081s)
  webkit          30.469s    30.735s    30.945s  |  32.005s    31.004s    32.401s  |  33.303s
                  (0.501s)   (0.376s)   (0.733s) |  (1.212s)   (0.725s)   (2.084s) |  (3.044s)

So xxHash is indeed the fastest, even in our use case, and MurmurHash3
comes in second.  The time spent hashing with 7 seeds tends to be
around twice as much as with enhanced double hashing when using the
same hash function.  However, the time spent hashing is only a
fraction of the total runtime, and while its effect on total runtime
is measurable both in best-of-five and average, it tends to be smaller
than run-to-run noise.

As for availability of hash functions:

  - MurmurHash3 is widely available; both the reference implementation
    and a streaming-capable ANSI C implementation is in the public
    domain.

  - xxHash is allegedly available in a variety of programming
    languages, as shown on www.xxhash.com (supposedly, but I haven't
    been able to load that page since months... some cloudflare host
    error persists)

  - SHA256 is widely available, and it must be part of every Git
    implementation in the near future anyway, but it's slower than the
    others, and, more importantly, it doesn't scale for k > 8.

  - FNV1a is so simple that anyone can implement a variant that
    incrementally computes two hashes up to the next directory
    separator in one go in about 20 lines of code (though note that
    the above benchmarks didn't use such an implementation).  Alas,
    because of its higher false positive rate it's out anyway.

Conclusion: we should seriously consider using MurmurHash3 (or xxHash)
and hashing each path k times with k different seeds instead of any
double hashing scheme.

Merges
------

It's not clear whether it's worth computing, storing and using
modified path Bloom filters for all parents of merge commits.

  - The number of paths modified between a merge commit and its
    second..nth parents is in general considerably larger than between
    any commit and its first parent.  E.g. a lot of files are modified
    in Git's master branch while a topic cooks for a few weeks, and
    much more in Linux when a branch started from the previous release
    is merged near the end of the next merge window.

                                 Average number of
                                   modified paths
                    Percentage      compared to:
                     of merge     first    second
                      commits     parent   parent
      --------------------------------------------
      android-base     73.6%       14.1    1553.6
      cmssw            11.0%       16.1     977.4
      cpython          11.7%        5.1     933.7
      elasticsearch     8.4%       40.1     246.5
      gecko-dev         3.5%       23.5     602.0
      git              25.3%        4.0     394.8
      homebrew-cask     9.6%        2.7      42.8
      jdk              25.0%      184.1     408.2
      linux             7.4%       26.1    2268.0
      rails            22.2%        9.6     101.0
      rust             27.0%       15.7     397.3
      tensorflow        9.1%       39.2    1057.4

    Consequently:

      - The tree-diff machinery has to work that much more to gather
        modified paths for all parents of merge commits, significantly
        increasing the runtime of writing the commit-graph file.

      - Storing modified path Bloom filters for all parents of merge
        commits significantly increases the size of the Modified Path
        Bloom Filters chunk, though depending on the percentage of
        merge commits and on the size of the other chunks the relative
        size increase of the whole commit-graph file might not be all
        that much.

      - [TODO: A few old test results suggest that pathspec-limited
        revision walks with default history simplification using a
        commit-graph file storing modified path Bloom filters for all
        merge parents are a few percents slower than when storing
        Bloom filters only for first parents.  Even fewer old test
        results suggest that writing all Bloom filters for first
        parents first, and then all for second..nth parents might
        eliminate much of that runtime difference.
        Definitely need more benchmarking.]

  - During a pathspec-limited revision walk Git's default history
    simplification only checks whether the given path was modified
    between a merge commit and its second..nth parents when the path
    was modified between that merge commit and its first parent.  This
    usually happens rarely, though these second..nth parent diffs tend
    to be more expensive than first parent diffs (because of the
    considerably more modified paths the tree-diff machinery can't
    short-circuit that early).  Anyway, the potential speedup is low.

  - However, with '--full-history', i.e. without any history
    simplification, all merge commits are compared to all their
    parents, and typical additional speedups are in the range of
    2x-3x, while in some cases over 7x or 11x can be achieved by using
    modified path Bloom filters for all parents of merge commits.

Does it worth it?  For me personally it doesn't, but I don't know how
often others use '--full-history' and what trade-offs they might be
willing to make.

So the file format described here adds _optional_ support for storing
modified path Bloom filters for all parents of merge commits, and the
users can make this decision themselves.

[TODO: describe it!]

Deduplication
-------------

Some commits have identical modified path Bloom filters, because they
modify the same set of paths (or because they modify different set of
paths but happen to end up setting the same bit positions in the Bloom
filter).  By omitting duplicates from the Modified Path Bloom Filters
chunk its size can be reduced typically by around 5-20%, and in case
of the android-base repository by over 69%.

Explicitly allow that multiple entries of the Modified Path Bloom
Filter Index chunk can refer to the same offset in the Modified Path
Bloom Filters chunk.  This is important, because even if an
implementation doesn't perform this deduplication while writing the
commit-graph file, it must be prepared for multiple index entries
referring to the same offset in commit-graph file written by a
different implementation.

Modified Path Bloom Filter Excludes
-----------------------------------

Some repositories contain leading directories that are modified in the
great majority of commits, e.g. the homebrew-core repository's
'Formula' directory is modified in over 99.7% of commits, while
homebrew-cask's 'Casks' and rust's 'src' are modified in over 93% of
commits.  And there is 'src/main/java/com/company/division/project' in
convention-following Java projects...

Modified path Bloom filters can't speed up revision walks when the
pathspec is such a frequently modified leading directory, because
because of potential false positives we'll have to run tree-diff for
the majority of commits anyway.  And it doesn't really make sense to
query the history of such a leading directory in practice, because it
will list the majority of commits, so one might as well look straight
at the output of a pathspec-less 'git log'.

However, adding those frequently modified leading directories to the
modified path Bloom filters requires more space and increases the
probability of false positives.

So the file format described here adds support for excluding specific
paths from modified path Bloom filters by listing them in the Modified
Path Bloom Filter Excludes chunk.

[TODO: Figure out the details!]

Limitations
-----------

Because of the possibility of false positives, if a modified path
Bloom filter query returns 'possibly in set', then we have to invoke
tree-diff to make sure that the path in question was indeed modified
by the given commit.  Consequently, Bloom filters can't improve
performance that much when looking for the history of a frequently
modified path, because a lot of tree-diff invocations can't be
eliminated.  In the extreme case when looking for the history of a
path modified in every commit, then using Bloom filters will only add
extra overhead.

A modified path Bloom filter doesn't store the names of modified
paths, it only sets a couple of bits based on those paths' hashes.
Consequently, it can only be used when looking for the history of a
concrete path, and:

  - It can't be used with a pathspec containing wildcards like 'git
    log -- "*.h"'.

    However, it could still be used when the pathspec contains leading
    directories without wildcards, e.g. 'git log --
    "arch/x86/include/*.h", by limiting tree-diff only to commits
    modifying the 'arch/x86/include/' directory.

  - It can't tell which paths were modified by a given commit; for
    that we would still have to run tree-diff for the full tree.

Submodules [TODO]
-----------------

No modified path Bloom filters should be stored for commits modifying
submodules.

This is questionable, but is necessary to produce the same output with
and without modified path Bloom filters.  If 'dir/submod' is a gitlink
file, then currently 'git rev-list HEAD -- dir/submod/whatever' lists
all commits touching 'dir/submod', even when that 'whatever' has never
existed.  And that 'whatever' can be basically anything, so we will
not find them in any of our modified path Bloom filters, therefore in
a Bloom-filter-assisted revision walk we won't list any commits.

The only way around this is to not not write any modified path Bloom
filters for commits modifying submodules.

Note, however, that once upon a time that command wouldn't list
anything, either, but the behavior changed with commit 74b4f7f277
(tree-walk.c: ignore trailing slash on submodule in
tree_entry_interesting(), 2014-01-23) to what we have now.  As
74b4f7f277's log message only talks about handling 'dir/submod/' and
'dir/submod' (i.e. with and without trailing slash) consistently, I
suspect that this behavior change with 'dir/submod/anything' is an
unintended and undesired side effect.  However, as it involves
submodules I would rather not have an opinion :)

In any case, someone with more clues about submodules should take a
closer look and decide whether this is a bug or not, before this
modified path Bloom filter thing goes much further.  If it is a bug
indeed, then it should be fixed and the remark about submodules should
be removed from the modified path Bloom filter specs.  If the current
behavior is desired, then the remark about submodules should be
updated to use proper English (IMO it must be part of the spec,
because this is a subtle issue that developers of other
implementations (JGit, libgit2, etc.) might easily overlook).

Threats to validity
-------------------

  - Random paths are... random.  Picking random paths can
    over-represent rarely modified files.  Since modified path Bloom
    filters bring more benefits to rarely modified paths, the reported
    speedups later in the series might be higher than what the users
    will usually see.  (I suppose that users more often check the logs
    of frequently modified files than of rarely modified ones.)

    Though some of these random paths made me stumble upon the issue
    with submodules discussed above, so...

  - Bugs :)  It's not that hard to make subtle bugs that don't affect
    correctness, because the probabilistic nature of Bloom filters
    cover them up.  However, bugs like incorrectly calculating the
    size of a Bloom filter or having an off-by-one error in the
    filter's array handling affect the false positive rate and in turn
    the runtime of pathspec-limited revision walks.

Alternatives considered
-----------------------

Here are some alternatives that I've considered but discarded and
ideas that I haven't (yet) followed through:

  - One Bloom filter to rule them all?  No.

    While the first proof of concept implementation [2] demonstrated
    that by combining hashes of modified pathn…
derrickstolee pushed a commit that referenced this pull request Jun 5, 2020
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Jun 5, 2020
In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Jun 15, 2020
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Jun 15, 2020
In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Aug 3, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Aug 4, 2020
The changed-path Bloom filter is improved using ideas from an
independent implementation.

* sg/commit-graph-cleanups:
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: clean up #includes
  diff.h: drop diff_tree_oid() & friends' return value
  commit-slab: add a function to deep free entries on the slab
  commit-graph-format.txt: all multi-byte numbers are in network byte order
  commit-graph: fix parsing the Chunk Lookup table
  tree-walk.c: don't match submodule entries for 'submod/anything'
derrickstolee pushed a commit that referenced this pull request Oct 8, 2020
Our put_be32() routine and its variants (get_be32(), put_be64(), etc)
has two implementations: on some platforms we cast memory in place and
use nothl()/htonl(), which can cause unaligned memory access. And on
others, we pick out the individual bytes using bitshifts.

This introduces extra complexity, and sometimes causes compilers to
generate warnings about type-punning. And it's not clear there's any
performance advantage.

This split goes back to 660231a (block-sha1: support for
architectures with memory alignment restrictions, 2009-08-12). The
unaligned versions were part of the original block-sha1 code in
d7c208a (Add new optimized C 'block-sha1' routines, 2009-08-05),
which says it is:

   Based on the mozilla SHA1 routine, but doing the input data accesses a
   word at a time and with 'htonl()' instead of loading bytes and shifting.

Back then, Linus provided timings versus the mozilla code which showed a
27% improvement:

  https://lore.kernel.org/git/alpine.LFD.2.01.0908051545000.3390@localhost.localdomain/

However, the unaligned loads were either not the useful part of that
speedup, or perhaps compilers and processors have changed since then.
Here are times for computing the sha1 of 4GB of random data, with and
without -DNO_UNALIGNED_LOADS (and BLK_SHA1=1, of course). This is with
gcc 10, -O2, and the processor is a Core i9-9880H.

  [stock]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.638 s ±  0.081 s    [User: 6.269 s, System: 0.368 s]
    Range (min … max):    6.550 s …  6.841 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):      6.418 s ±  0.015 s    [User: 6.058 s, System: 0.360 s]
    Range (min … max):    6.394 s …  6.447 s    10 runs

And here's the same test run on an AMD A8-7600, using gcc 8.

  [stock]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):     11.721 s ±  0.113 s    [User: 10.761 s, System: 0.951 s]
    Range (min … max):   11.509 s … 11.861 s    10 runs

  [-DNO_UNALIGNED_LOADS]
  Benchmark #1: t/helper/test-tool sha1 <foo.rand
    Time (mean ± σ):     11.744 s ±  0.066 s    [User: 10.807 s, System: 0.928 s]
    Range (min … max):   11.637 s … 11.863 s    10 runs

So the unaligned loads don't seem to help much, and actually make things
worse. It's possible there are platforms where they provide more
benefit, but:

  - the non-x86 platforms for which we use this code are old and obscure
    (powerpc and s390).

  - the main caller that cares about performance is block-sha1. But
    these days it is rarely used anyway, in favor of sha1dc (which is
    already much slower, and nobody seems to have cared that much).

Let's just drop unaligned versions entirely in the name of simplicity.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Nov 3, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Dec 23, 2020
Test 5572.63 ("branch has no merge base with remote-tracking
counterpart") was introduced in 4d36f88 (submodule: do not pass null
OID to setup_revisions, 2018-05-24), as a regression test for the bug
this commit was fixing (preventing a 'fatal: bad object' error when the
current branch and the remote-tracking branch we are pulling have no
merge-base).

However, the commit message for 4d36f88 does not describe in which
real-life situation this bug was encountered. The brief discussion on the
mailing list [1] does not either.

The regression test is not really representative of a real-life
scenario: both the local repository and its upstream have only a single
commit, and the "no merge-base" scenario is simulated by recreating this
root commit in the local repository using 'git commit-tree' before
calling 'git pull --rebase --recurse-submodules'. The rebase succeeds
and results in the local branch being reset to the same root commit as
the upstream branch.

The fix in 4d36f88 modifies 'submodule.c::submodule_touches_in_range'
so that if 'excl_oid' is null, which is the case when the 'git merge-base
--fork-point' invocation in 'builtin/pull.c::get_rebase_fork_point'
errors (no fork-point), then instead of 'incl_oid --not excl_oid' being
passed to setup_revisions, only 'incl_oid' is passed, and
'submodule_touches_in_range' examines 'incl_oid' and all its ancestors
to verify that they do not touch the submodule.

In test 5572.63, the recreated lone root commit in the local repository is
thus the only commit being examined by 'submodule_touches_in_range', and
this commit *adds* the submodule. However, 'submodule_touches_in_range'
*succeeds* because 'combine-diff.c::diff_tree_combined' (see the
backtrace below) returns early since this commit is the root commit
and has no parents.

  #0  diff_tree_combined at combine-diff.c:1494
  #1  0x0000000100150cbe in diff_tree_combined_merge at combine-diff.c:1649
  #2  0x00000001002c7147 in collect_changed_submodules at submodule.c:869
  #3  0x00000001002c7d6f in submodule_touches_in_range at submodule.c:1268
  #4  0x00000001000ad58b in cmd_pull at builtin/pull.c:1040

In light of all this, add a note in t5572 documenting this peculiar
test.

[1] https://lore.kernel.org/git/20180524204729.19896-1-jonathantanmy@google.com/t/#u

Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
derrickstolee pushed a commit that referenced this pull request Dec 23, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
dscho added a commit that referenced this pull request Dec 24, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee pushed a commit that referenced this pull request Dec 29, 2020
Includes these pull requests:

	#1
	#6
	#10
	#11
	git#157
	git#212
	git#260
	git#270

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Jan 4, 2021
The previous change reduced time spent in strlen() while comparing
consecutive paths in verify_cache(), but we can do better. The
conditional checks the existence of a directory separator at the correct
location, but only after doing a string comparison. Swap the order to be
logically equivalent but perform fewer string comparisons.

To test the effect on performance, I used a repository with over three
million paths in the index. I then ran the following command on repeat:

  git -c index.threads=1 commit --amend --allow-empty --no-edit

Here are the measurements over 10 runs after a 5-run warmup:

  Benchmark #1: v2.30.0
    Time (mean ± σ):     854.5 ms ±  18.2 ms
    Range (min … max):   825.0 ms … 892.8 ms

  Benchmark #2: Previous change
    Time (mean ± σ):     833.2 ms ±  10.3 ms
    Range (min … max):   815.8 ms … 849.7 ms

  Benchmark #3: This change
    Time (mean ± σ):     815.5 ms ±  18.1 ms
    Range (min … max):   795.4 ms … 849.5 ms

This change is 2% faster than the previous change and 5% faster than
v2.30.0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Jan 4, 2021
The previous change reduced time spent in strlen() while comparing
consecutive paths in verify_cache(), but we can do better. The
conditional checks the existence of a directory separator at the correct
location, but only after doing a string comparison. Swap the order to be
logically equivalent but perform fewer string comparisons.

To test the effect on performance, I used a repository with over three
million paths in the index. I then ran the following command on repeat:

  git -c index.threads=1 commit --amend --allow-empty --no-edit

Here are the measurements over 10 runs after a 5-run warmup:

  Benchmark #1: v2.30.0
    Time (mean ± σ):     854.5 ms ±  18.2 ms
    Range (min … max):   825.0 ms … 892.8 ms

  Benchmark #2: Previous change
    Time (mean ± σ):     833.2 ms ±  10.3 ms
    Range (min … max):   815.8 ms … 849.7 ms

  Benchmark #3: This change
    Time (mean ± σ):     815.5 ms ±  18.1 ms
    Range (min … max):   795.4 ms … 849.5 ms

This change is 2% faster than the previous change and 5% faster than
v2.30.0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Jan 4, 2021
This is enough to make 'git -c core.fsmonitor="" status -uno' operate
entirely on a sparse index without expanding to a full index.

Warm cache case:
Benchmark #1: full index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      2.564 s ±  0.058 s    [User: 1.489 s, System: 1.006 s]
  Range (min … max):    2.494 s …  2.685 s    10 runs

Benchmark #2: sparse index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):     127.0 ms ±   4.1 ms    [User: 118.2 ms, System: 183.5 ms]
  Range (min … max):   119.3 ms … 134.2 ms    22 runs

Summary
  'sparse index (git -c core.fsmonitor= status -uno)' ran
   20.19 ± 0.80 times faster than 'full index (git -c core.fsmonitor= status -uno)'

Cold cache case:
Benchmark #1: full index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      3.911 s ±  0.080 s    [User: 1.550 s, System: 1.503 s]
  Range (min … max):    3.777 s …  4.020 s    10 runs

Benchmark #2: sparse index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      2.973 s ±  0.097 s    [User: 291.3 ms, System: 1082.6 ms]
  Range (min … max):    2.811 s …  3.078 s    10 runs

Summary
  'sparse index (git -c core.fsmonitor= status -uno)' ran
    1.32 ± 0.05 times faster than 'full index (git -c core.fsmonitor= status -uno)'

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Jan 7, 2021
The previous change reduced time spent in strlen() while comparing
consecutive paths in verify_cache(), but we can do better. The
conditional checks the existence of a directory separator at the correct
location, but only after doing a string comparison. Swap the order to be
logically equivalent but perform fewer string comparisons.

To test the effect on performance, I used a repository with over three
million paths in the index. I then ran the following command on repeat:

  git -c index.threads=1 commit --amend --allow-empty --no-edit

Here are the measurements over 10 runs after a 5-run warmup:

  Benchmark #1: v2.30.0
    Time (mean ± σ):     854.5 ms ±  18.2 ms
    Range (min … max):   825.0 ms … 892.8 ms

  Benchmark #2: Previous change
    Time (mean ± σ):     833.2 ms ±  10.3 ms
    Range (min … max):   815.8 ms … 849.7 ms

  Benchmark #3: This change
    Time (mean ± σ):     815.5 ms ±  18.1 ms
    Range (min … max):   795.4 ms … 849.5 ms

This change is 2% faster than the previous change and 5% faster than
v2.30.0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
derrickstolee added a commit that referenced this pull request Jan 19, 2021
This is enough to make 'git -c core.fsmonitor="" status -uno' operate
entirely on a sparse index without expanding to a full index.

Warm cache case:
Benchmark #1: full index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      2.564 s ±  0.058 s    [User: 1.489 s, System: 1.006 s]
  Range (min … max):    2.494 s …  2.685 s    10 runs

Benchmark #2: sparse index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):     127.0 ms ±   4.1 ms    [User: 118.2 ms, System: 183.5 ms]
  Range (min … max):   119.3 ms … 134.2 ms    22 runs

Summary
  'sparse index (git -c core.fsmonitor= status -uno)' ran
   20.19 ± 0.80 times faster than 'full index (git -c core.fsmonitor= status -uno)'

Cold cache case:
Benchmark #1: full index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      3.911 s ±  0.080 s    [User: 1.550 s, System: 1.503 s]
  Range (min … max):    3.777 s …  4.020 s    10 runs

Benchmark #2: sparse index (git -c core.fsmonitor= status -uno)
  Time (mean ± σ):      2.973 s ±  0.097 s    [User: 291.3 ms, System: 1082.6 ms]
  Range (min … max):    2.811 s …  3.078 s    10 runs

Summary
  'sparse index (git -c core.fsmonitor= status -uno)' ran
    1.32 ± 0.05 times faster than 'full index (git -c core.fsmonitor= status -uno)'

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant