Problem
Especially for server-side usages, repacking is extremely expensive, and creating multi-pack-indexes is still rather expensive. Incremental MIDX partially solves this, but would defeat the purpose of MIDX when there are too many of them, as Git would still have to walk the MIDXes in order while performing expensive indexing queries.
Idea
Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and
ordinary .idx files are still traversed in their usual order, but the first
step when traversing them, is to check whether that index could possibly have
the desired object, through a bloom filter.
We will want the filters to be mmaped, and we want the lookup cost to be dominated by one cache-line read rather than using many scattered reads. Therefore, a blocked bloom filter is likely the right direction here. The steps are as follows:
- Split the filter into -octet buckets, since octets is the most common cache-line size.
- Use some bits of the object ID to choose the bucket.
- Use the rest of the key to choose several bit positions inside that bucket.
- A lookup thus reads one -octet bucket and checks whether all required bits are set.
Note on Object IDs
Git object IDs are cryptographic hashes (e.g., currently either SHA-256 or SHA-1), and are thus uniformly distributed in non-pathological scenarios. See also the “Security considerations” section.
Definitions
Let:
Note that:
- All integers here are big endian.
- The OID is to be interpreted as a big-endian bitstring, where bit offset is the most significant bit of octet .
- .
- In this document and refer to bit shifts, not “much less than”, etc.
File layout
- -octet signature:
{'I', 'D', 'B', 'L'} - -octet version identifier ()
- -octet object hash algorithm identifier ( for SHA-1, for SHA-256)
- -octet (number of buckets)
- -octet (number of bits set and tested per object ID)
- -octet padding (must be all zeros)
- buckets of octets each.
Validation
- Matching signature
- Supported version (the rest of the rules are for this version)
- Hash function identifier must be recognized
- must be nonzero and a power of two
- must be nonzero
- Padding must be all zero
- File size must be octets
Lookup procedure
- Let be the unsigned integer encoded by the most significant bits of OID. is a power of two, and .
- Select and read bucket .
- For each :
- Start immediately after the most significant bits of OID, let the -th -bit field be the bits at offset through within the next bits of the OID.
- Let be the unsigned integer encoded by that -bit field. Then, .
- Compute , and . Thus, identifies one of the -bit words in bucket , and identifies one bit within that word.
- Test whether is set in the word of bucket . (Within each -bit word, bit index denotes the most significant bit, and bit index denotes the least significant bit.)
If any test fails, the OID is definitely not in the relevant idx. If all tests succeed, the OID may be in the relevant idx.
Note that two of the -bit fields can decode to the same , which means an insertion may set fewer than distinct bits.
Worked example
Let:
Then, . Each lookup thus uses bits to choose the bucket and bits to choose the in-bucket positions, for a total of bits taken from the object ID.
- Read the first bits of OID and interpret them as , where .
- Read bucket .
- For each :
- Read the -th -bit field from the next bits of OID and interpret it as , where .
- Compute: , .
- Test whether bit is set in the word of bucket .
Security considerations
An adversarial packfile where objects are (computationally intensive, even for SHA-1 as vulnerable as it is) constructed to have the same prefix for the relevant object format hash algorithm could be used to fill up the bloom filters, rendering some buckets useless. In the worst case, if they somehow fill all filters, this proposal’s optimizations become useless, but would not be a significant DoS vector.
TODOs
- How should B and K be chosen?
- How does creation/insert work? Note that packfiles and
.idxes are immutable. - What are the sizes?
- What are the false positive rates?
- Is there a way to make this SIMD friendly?
- How are benchmarks?