Git packfile index bloom filter RFC

Problem

Especially for server-side usages, repacking is extremely expensive, and creating multi-pack-indexes is still rather expensive. Incremental MIDX partially solves this, but would defeat the purpose of MIDX when there are too many of them, as Git would still have to walk the MIDXes in order while performing expensive indexing queries.

Idea

Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and ordinary .idx files are still traversed in their usual order, but the first step when traversing them, is to check whether that index could possibly have the desired object, through a bloom filter.

We will want the filters to be mmaped, and we want the lookup cost to be dominated by one cache-line read rather than using many scattered reads. Therefore, a blocked bloom filter is likely the right direction here. The steps are as follows:

  1. Split the filter into 6464-octet buckets, since 6464 octets is the most common cache-line size.
  2. Use some bits of the object ID to choose the bucket.
  3. Use the rest of the key to choose several bit positions inside that bucket.
  4. A lookup thus reads one 6464-octet bucket and checks whether all required bits are set.

Note on Object IDs

Git object IDs are cryptographic hashes (e.g., currently either SHA-256 or SHA-1), and are thus uniformly distributed in non-pathological scenarios. See also the “Security considerations” section.

Definitions

Let:

Note that:

File layout

Validation

Lookup procedure

  1. Let bb be the unsigned integer encoded by the most significant log2(B)\log_2(B) bits of OID. BB is a power of two, and 0b<B0 \le b < B.
  2. Select and read bucket bb.
  3. For each 0i<K0 \le i < K:
    1. Start immediately after the most significant log2(B)\log_2(B) bits of OID, let the ii-th 99-bit field be the bits at offset 9×i9 \times i through 9×i+89 \times i + 8 within the next 9×K9 \times K bits of the OID.
    2. Let pip_i be the unsigned integer encoded by that 99-bit field. Then, 0pi<5120 \le p_i < 512.
    3. Compute wi=pi6w_i = p_i \gg 6, and bi=pi&63b_i = p_i \:\&\: 63. Thus, wiw_i identifies one of the 88 6464-bit words in bucket bb, and bib_i identifies one bit within that word.
    4. Test whether bib_i is set in the word wiw_i of bucket bb. (Within each 6464-bit word, bit index 00 denotes the most significant bit, and bit index 6363 denotes the least significant bit.)

If any test fails, the OID is definitely not in the relevant idx. If all tests succeed, the OID may be in the relevant idx.

Note that two of the KK 99-bit fields can decode to the same pip_i, which means an insertion may set fewer than KK distinct bits.

Worked example

Let:

Then, log2(B)=15\log_2(B) = 15. Each lookup thus uses 1515 bits to choose the bucket and 8×9=728 \times 9 = 72 bits to choose the in-bucket positions, for a total of 8787 bits taken from the object ID.

  1. Read the first 1515 bits of OID and interpret them as bb, where 0b<327680 \le b < 32768.
  2. Read bucket bb.
  3. For each 0i<80 \le i < 8:
    1. Read the ii-th 99-bit field from the next 7272 bits of OID and interpret it as pip_i, where 0pi<5120 \le p_i < 512.
    2. Compute: wi=pi6w_i = p_i \gg 6, bi=pi&63b_i = p_i \:\&\: 63.
    3. Test whether bit bib_i is set in the word wiw_i of bucket bb.

Security considerations

An adversarial packfile where objects are (computationally intensive, even for SHA-1 as vulnerable as it is) constructed to have the same prefix for the relevant object format hash algorithm could be used to fill up the bloom filters, rendering some buckets useless. In the worst case, if they somehow fill all filters, this proposal’s optimizations become useless, but would not be a significant DoS vector.

TODOs