On (not) using “cryptographic hashes” for hash table keys

A hash table maps keys to buckets (or, sometimes, more complex schemes such as bucket directories and sizes, as seen in extendible hashing, probe chains in open addressing, etc.).

In every case there is a function $h : \text{keys} \to \{0, 1\}^*$ (or some prefix or reduction thereof) whose output the scheme consults to decide where a key lives.

What must $h$ be?

The naïve answer of “something that spreads key evenly” may be catestrophically wrong under certain adversarial conductions. In particular, if for some reason (such as for content-addressing) a deterministic public hash such as SHA-256 already exists for the key, it is very tempting but also very likely incorrect to index hash tables with it directly, with potentially adversarial inputs.

Hash function properties

(There’s more such as immunity to length extension, which are irrelevant here but necessary elsewhere.)

Distribution/uniformity: For “typical” non-adversarial inputs drawn from some natural distribution, outputs are spread roughly evenly across the codomain.
Avalanche: A one-bit change in input is expected to flip roughly half the output bits.
Collision resistance: Finding $x \ne x'$ with $h(x) = h(x')$ is computationally infeasible.
Preimage and second-preimage resistance: Given an output $y$ or a specific input $x$ , finding $x'$ with $h(x') = y$ (or $h(x') = h(x), x' \ne x$ ) is computationally infeasible.
PRF security: A keyed function $F_k(x)$ such that an adversary lacking $k$ cannot distinguish $F_k$ from a uniformly random function $f : keys \to \{0, 1\}^n$ , even when given the ability to query it at inputs of their choice.

The temptation is to assume that “more cryptographic” implies “more secure for any purpose.”

What does a hash table need?

For honest workloads, a hash table needs inputs and the hash function to interact such that whatever location-space the scheme uses is occupied evenly. Distribution and avalanche suffice.

For adversarial workloads, which is to say, any case where an adversary can supply inputs to the table, each hsah table scheme has a degenerate regime that an adversary would like to force for a DoS:

A chained table degrades when many keys share a bucket, lengthening a chain to $\Theta(n)$ .
An open-addressed table degrades when many keys’ probe sequences collide on the same slots.
An extendible hash table degrades when many keys share a long prefix, such that the directory doubles exponentially until the global depth reaches the prefix’s length.
And so on.

In each, the adversary’s task is to find an input set $S$ such that $h(x) : x \in S$ exhibits some specific non-uniformity that causes the hash table scheme to degrade. The developer’s task is thus to make constructing such an $S$ to be infeasible.

What is thus necessary, is that outputs are unpredictable to the adversary; this excludes ordinary cryptographic hash functions such as SHA-256. Unpreductability requires that the adversary lack some piece of information that the defender has, which is to say, some sort of secret key. With no key, no public function, however “cryptographically strong” in the ordinary sense, can be unpredictable to the adversary.

Scenario

Consider a service that receives Git objects, computes their SHA-256, and uses the SHA-256 directly to index a hash table, say, an extendible hash table keyed by the digest’s leading bits. If the adversary wants $10^6$ inputs whose SHA-256 hashes share a 20-bit prefix, they sample roughly $10^9$ public Git objects, and filter. Submitting these to the service causes the extendible hash directory to global depth $\ge 20$ and size $\ge 2^{20}$ . Similarly goes for other hash table schemes.

A possible solution would let the hash table scheme’s input be $\text{SipHash}_k(\text{SHA-256}(x))$ , where $k$ is a secret unknown to the adversary. Hashing the digest instead of the original object is fine since SHA-256 is injective in practice over the inputs we care about; in fact it would be necessary in schemes where you would otherwise (insecurely) key by the object ID itself, since you most definitely would like to query objects by object ID.

SipHash’s lack of collision resistance

SipHash’s 64-bit output makes finding a collision feasible in roughly $2^{32}$ work via the birthday bound, which would be bad for content addressing, message authentication, or digital signatures, but is irrelevant for a hash table, since collisions, prefix collisions, etc., are all events that the table must already support at some cost, and what matters is that an adversary cannot deliberately target such events without the key.