Git packfile index bloom filter RFC

Problem

Especially for server-side usages, repacking is extremely expensive, and creating multi-pack-indexes is still rather expensive. Incremental MIDX partially solves this, but would defeat the purpose of MIDX when there are too many of them, as Git would still have to walk the MIDXes in order while performing expensive indexing queries.

Idea

Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and ordinary .idx files are still traversed in their usual order, but the first step when traversing them, is to check whether that index could possibly have the desired object, through a bloom filter.

We will want the filters to be mmaped, and we want the lookup cost to be dominated by one cache-line read rather than using many scattered reads. Therefore, a blocked bloom filter is likely the right direction here. The steps are as follows:

Split the filter into $64$ -octet buckets, since $64$ octets is the most common cache-line size.
Use some bits of the object ID to choose the bucket.
Use the rest of the key to choose several bit positions inside that bucket.
A lookup thus reads one $64$ -octet bucket and checks whether all required bits are set.

Note on Object IDs

Git object IDs are cryptographic hashes (e.g., currently either SHA-256 or SHA-1), and are thus uniformly distributed in non-pathological scenarios. See also the “Security considerations” section.

Definitions

Let:

$B = \text{number of buckets}$
$K = \text{number of bits set and tested per object ID}$

Note that:

All integers here are big endian.
The OID is to be interpreted as a big-endian bitstring, where bit offset $0$ is the most significant bit of octet $0$ .
$\log_2(B) + 9K \le \text{hash length in bits}$ .
In this document $\ll$ and $\gg$ refer to bit shifts, not “much less than”, etc.

File layout

$4$ -octet signature: {'I', 'D', 'B', 'L'}
$4$ -octet version identifier ( $= 1$ )
$4$ -octet object hash algorithm identifier ( $= 1$ for SHA-1, $2$ for SHA-256)
$4$ -octet $B$ (number of buckets)
$2$ -octet $K$ (number of bits set and tested per object ID)
$6$ -octet padding (must be all zeros)
$B$ buckets of $64$ octets each.

Validation

Matching signature
Supported version (the rest of the rules are for this version)
Hash function identifier must be recognized
$B$ must be nonzero and a power of two
$K$ must be nonzero
$\log_2(B) + 9K \le \text{hash length in bits}$
Padding must be all zero
File size must be $24 + 64 \times B$ octets

Lookup procedure

Let $b$ be the unsigned integer encoded by the most significant $\log_2(B)$ bits of OID. $B$ is a power of two, and $0 \le b < B$ .
Select and read bucket $b$ .
For each $0 \le i < K$ :
1. Start immediately after the most significant $\log_2(B)$ bits of OID, let the $i$ -th $9$ -bit field be the bits at offset $9 \times i$ through $9 \times i + 8$ within the next $9 \times K$ bits of the OID.
2. Let $p_i$ be the unsigned integer encoded by that $9$ -bit field. Then, $0 \le p_i < 512$ .
3. Compute $w_i = p_i \gg 6$ , and $b_i = p_i \:\&\: 63$ . Thus, $w_i$ identifies one of the $8$ $64$ -bit words in bucket $b$ , and $b_i$ identifies one bit within that word.
4. Test whether $b_i$ is set in the word $w_i$ of bucket $b$ . (Within each $64$ -bit word, bit index $0$ denotes the most significant bit, and bit index $63$ denotes the least significant bit.)

If any test fails, the OID is definitely not in the relevant idx. If all tests succeed, the OID may be in the relevant idx.

Note that two of the $K$ $9$ -bit fields can decode to the same $p_i$ , which means an insertion may set fewer than $K$ distinct bits.

Worked example

Let:

$B = 1 \ll 15 = 32768$
$K = 8$

Then, $\log_2(B) = 15$ . Each lookup thus uses $15$ bits to choose the bucket and $8 \times 9 = 72$ bits to choose the in-bucket positions, for a total of $87$ bits taken from the object ID.

Read the first $15$ bits of OID and interpret them as $b$ , where $0 \le b < 32768$ .
Read bucket $b$ .
For each $0 \le i < 8$ :
1. Read the $i$ -th $9$ -bit field from the next $72$ bits of OID and interpret it as $p_i$ , where $0 \le p_i < 512$ .
2. Compute: $w_i = p_i \gg 6$ , $b_i = p_i \:\&\: 63$ .
3. Test whether bit $b_i$ is set in the word $w_i$ of bucket $b$ .

Security considerations

An adversarial packfile where objects are (computationally intensive, even for SHA-1 as vulnerable as it is) constructed to have the same prefix for the relevant object format hash algorithm could be used to fill up the bloom filters, rendering some buckets useless. In the worst case, if they somehow fill all filters, this proposal’s optimizations become useless, but would not be a significant DoS vector.

TODOs

How should B and K be chosen?
How does creation/insert work? Note that packfiles and .idxes are immutable.
What are the sizes?
What are the false positive rates?
Is there a way to make this SIMD friendly?
How are benchmarks?