If you are checking a lot of keys, and you don’t expect that many of them actually are compromised, hitting the pwnedkeys API for every single key can be a bit of a buzzkill. To make your life easier, we provide a bloom filter dataset for commercial users of pwnedkeys.com, which you can use to answer the question “is this key probably known to be compromised?”
One properly of a bloom filter is that it can tell you that a key is definitely not in the set of known-compromised keys, but it can only say that the key is probably in the set. So if the bloom filter comes back “positive”, you still need to run that key against the API, both to get a definitive “yes, this is compromised” answer, and also to collect the verifiable compromise attestation. However, assuming that you are testing keys at random, only having to hit the pwnedkeys API when the filter comes back positive should be a significant reduction in traffic.
This page describes the file format used to store the bloom filter data provided by pwnedkeys.com, and also describes how to query the filter.
The filter file consists of two parts:
a header containing an identity marker, version and creation time information, entry count, and the filter parameters; and
The filter data itself.
The first 24 bytes of the filter data file comprise the file header, which contains the filter metadata.
The fields are as follows. All integers are in network (big endian) byte order.
|0||6||String||File format marker. This will always be the string pkbfv1, which corresponds to the hex bytes 70 6B 62 66 76 31.|
|6||4||32-bit unsigned integer||File revision counter. The revision counter is used when obtaining incremental updates to the filter data file. See the section "Updating the data file", below, for more information.|
|10||8||64-bit unsigned integer||Last update time. Provides an indication of when this filter file's data was last updated. Represents the number of seconds elapsed since the UNIX epoch (1970-01-01 00:00:00 UTC).|
|18||4||32-bit unsigned integer||Entry count. Indicates how many keys are represented in the filter data. This can be used, along with the filter parameters, to estimate the false-positive rate likely to exist within the data.|
|22||1||8-bit unsigned integer||Hash count. Specifies the k parameter of the filter, which is the number of bits which are set for each entry in the filter.|
|23||1||8-bit unsigned integer||Hash length. Specifies the length of each value in the hash, in bits. The size of the lookup table, in bits, (known as m) is thus 2(hash length). This can also be used to find the size of the bloom filter data on disk (or in memory) in bytes, by 2(hash length) / 8.|
|24||variable||bit field||Bloom filter data. It is used as a bit array.|
Querying the bloom filter
To answer the question, “is this key probably known to be compromised?”, you need to determine the bloom filter bit positions for the SPKI fingerprint of the key you’re testing, and then test whether all those bit positions are set, using the following algorithm:
Extract or construct the SPKI data structure of the key you’re testing, and convert that into a DER-encoded binary string, which will will refer to as
0) – that is, apply the XXH64 hash function over the DER-encoded SPKI, with a hash seed value of
0. We will refer to this value as h1.
1). If this value is even, it must be incremented by one, in order to make it odd. We will refer to this value as h2.
Calculate a total of
knumbers, representing bits in the bloom filter to test, using the following formula:
fi = (h1 + i * h2 + (i3 - i) / 6) mod m
iis in the range
k - 1.
For each of the
knumbers calculated, which are each in the range
m - 1, check whether the bit at that position in the bloom filter data is set. If any relevant bit position is unset (equal to
0), then the key you are testing definitely does not exist in the filter, and the algorithm can be terminated with the result “not known to be compromised”.
If all bits at the positions calculated are set (equal to
1), then the key being tested probably exists in the pwnedkeys database. You should perform an online lookup using the pwnedkeys API to be sure, and also to retrieve a cryptographic proof of compromise.
The reference implementation of the filter querying code is the pwnedkeys-filter ruby gem.
If you have an open source implementation you would like to have listed here, please get in touch.
For testing purposes, the following bloom filter data files are available. They each contain the canonical pwnedkeys example keys only. There are a number of different combinations of hash count and hash length, so you can validate the operation of your lookup code in different conditions.
- 2 hashes, 4 bit hash length
- 3 hashes, 6 bit hash length
- 5 hashes, 12 bit hash length
- 12 hashes, 18 bit hash length
The algorithm for calculating the bit positions to test comes from section 5.2 of Bloom Filters in Probabilistic Verification, by Peter C. Dilinger and Panagiotis Manolios.