Reducing ML-KEM-768 encapsulation key sizes by 24 octets

Disclaimer: I am not a cryptographer. There may be serious bugs or side channels!

The minimum MTU for IPv6 is 1280 octets; if you subtract the 40-octet IPv6 header and the 8-octet UDP header, you get 1232 usable octets. ML-KEM-768 public keys are 1184 octets long, leaving little space for other protocol material that you might want to stuff into a single UDP packet.

Can you compress ML-KEM-768 public keys?

Note that, in the standard ML-KEM-768 public key layout, octets $\left[0, 1152\right)$ are three polynomials, each with 256 coefficients, each 12 bits. Octets $\left[1152, 1184\right)$ are the seed $\rho$ which is considered to be indistinguishable from random.

But wait. $q = 3329$ , and $\log_2 q \approx 11.70087316$ . Yet we are storing each coefficient in 12 bits, wasting about 0.3 bits each. Let’s try to compress it, even if it doesn’t save much.

The construction we will propose, has been described in eprint 2016/461, “NTRU Prime: reducing attack surface at low cost” by Bernstein, Chuengsatiansup, Lange, and Vredendaal. Now, NTRU Prime is not ML-KEM, but the same compression applies regardless. To quote section 2.2,

The encoding of public keys as strings is another parameter for Streamlined NTRU Prime. Each element of $\mathbb{Z}/q$ is traditionally encoded as $\lceil \log_2 q \rceil$ bits, so the public key is traditionally encoded as $p \lceil \log_2 q \rceil$ bits. If $q$ is noticeably smaller than a power of 2 then one can easily compress a public key by merging adjacent elements of $\mathbb{Z}/q$ , with a lower limit of $p \lceil \log_2 q \rceil$ . For example, 5 elements of $\mathbb{Z}/q$ for $q = 4591$ are easily encoded together as 8 octets, saving 1.5% compared to separately encoding each element as 13 bits, and 20% compared to separately encoding each element as 2 octets.

The maximum that could be saved with this technique for ML-KEM-768 is 28 octets by using groups of 24 instead of 4, but is significantly more difficult to implement correctly. Let’s do groups of 4 coefficients.

Let’s encode each group of 4 coefficeints $(c_0, c_1, c_2, c_3)$ as the integer $c_3\cdot q^3 + c_2\cdot q^2 + c_1\cdot q + c_0 \in [0, q⁴)$ , requiring ⌈4·log₂(q)⌉ = 47 bits.

// Pack a standard 1184-octet encapsulation key
// into a 1160-octet packed representation.
//
// Returns [[errors::invalid]] if any coefficient is ≥ q.
export fn ek768_pack(
	ek: *[EK768_SZ]u8,
) ([PACKED_EK768_SZ]u8 | errors::invalid) = {
	let out: [PACKED_EK768_SZ]u8 = [0...];

	let buf: u64 = 0;
	let buf_len: u8 = 0;
	let out_idx = 0z;

	for (let g = 0z; g < n_groups; g += 1) {
		// Decode 4 coefs from consecutive 12-bit pairs
		// g: 4g..4g+3, from 6
		const base = g * (4 * 12 / 8);

		const d0: u32 = (ek[base + 0]: u32)
			| (ek[base + 1]: u32) << 8
			| (ek[base + 2]: u32) << 16;
		const c0 = field_check_reduced((d0 & 0xFFF): u16)?;
		const c1 = field_check_reduced((d0 >> 12): u16)?;

		const d1: u32 = (ek[base + 3]: u32)
			| (ek[base + 4]: u32) << 8
			| (ek[base + 5]: u32) << 16;
		const c2 = field_check_reduced((d1 & 0xFFF): u16)?;
		const c3 = field_check_reduced((d1 >> 12): u16)?;

		// Horner
		let v: u64 = c3;
		v = v * q + c2;
		v = v * q + c1;
		v = v * q + c0;

		// Accum 47 bits
		// buf_len ≤ 7 so buf_len + 47 ≤ 54 < 64.
		buf |= v << buf_len;
		buf_len += bits_per_grp;
		for (buf_len >= 8) {
			out[out_idx] = (buf & 0xFF): u8;
			out_idx += 1;
			buf >>= 8;
			buf_len -= 8;
		};
	};
	// 192 × 47 = 9024 = 1128 × 8

	// ρ
	out[packed_poly_sz..] = ek[EK768_SZ - 32..];
	return out;
};

// Unpack a 1160-octet packed representation
// back to a standard 1184-octet encapsulation key.
//
// Returns [[errors::invalid]] if any group value ≥ q⁴.
export fn ek768_unpack(
	packed: *[PACKED_EK768_SZ]u8,
) ([EK768_SZ]u8 | errors::invalid) = {
	let ek: [EK768_SZ]u8 = [0...];

	let buf: u64 = 0;
	let buf_len: u8 = 0;
	let in_idx = 0z;
	const mask: u64 = (1u64 << bits_per_grp) - 1;

	for (let g = 0z; g < n_groups; g += 1) {
		for (buf_len < bits_per_grp) {
			buf |= (packed[in_idx]: u64) << buf_len;
			in_idx += 1;
			buf_len += 8;
		};
		const v = buf & mask;
		buf >>= bits_per_grp;
		buf_len -= bits_per_grp;

		if (v >= q_to_the_4th)
			return errors::invalid;

		// The public key is public,
		// so this need not be constant-time.
		//
		// All < q because v < q⁴.
		let r = v;
		const c0: u16 = (r % q): u16; r /= q;
		const c1: u16 = (r % q): u16; r /= q;
		const c2: u16 = (r % q): u16; r /= q;
		const c3: u16 = r: u16;

		// Make them two 12-bit pairs again.
		const base = g * 6;
		const e0: u32 = (c0: u32) | (c1: u32) << 12;
		ek[base + 0] = e0: u8;
		ek[base + 1] = (e0 >> 8): u8;
		ek[base + 2] = (e0 >> 16): u8;

		const e1: u32 = (c2: u32) | (c3: u32) << 12;
		ek[base + 3] = e1: u8;
		ek[base + 4] = (e1 >> 8): u8;
		ek[base + 5] = (e1 >> 16): u8;
	};

	// ρ
	ek[EK768_SZ - 32..] = packed[packed_poly_sz..];
	return ek;
};

See hare-xcrypto’s mlkem768pack for an implementation in Hare with no security guarantees.