HyperLogLog in SQL
Counting the number of distinct elements within a set is a simple task. Yet it can be surprisingly difficult to do this efficiently at scale. That is because there is no obvious way to store intermediate results of a distinct count, except for actually storing the distinct values seen so far. This is expensive in two ways: it needs memory linear to the number of distinct values. And there is no trivial way to merge two partial intermediate results, so the calculation doesn't parallelize well.
The HyperLogLog algorithm is an approximation for the exact distinct count, with configurable accuracy. It has two main properties that make it very interesting: it only uses very little memory, and it can easily be parallelized. The latter implies that it also can be used to pre-aggregate intermediate results.
So the HyperLogLog basically reduces the computational complexity of a distinct count back to a normal count, while adding a small error to the result. However it is not necessarily faster than a regular distinct count - in fact often the opposite can be true in practice, since the actual computational work increases quite a bit. Nevertheless a speedup is possible in the following cases:
- the implementation is heavily parallelized.
- the reduced memory usage prevents swapping.
- the intermediate results are used as pre-aggregation to speedup online queries.
The high-level idea of the algorithm is the following: input data is randomly but deterministically mapped to the values of a given random variable, according to its distribution. Only the most unlikely value generated with this process is stored in a register. The more of an outlier that value is, the more distinct elements there probably are. This is done at the same time for multiple registers with different random mappings, and their results are averaged to improve the accuracy.
|------+------+------+------| | 20 | 21 | 22 | 23 | |------+------+------+------| | 1111 | 0111 | 0011 | 0001 | | 1110 | 0110 | 0010 | | | 1101 | 0101 | | | | 1100 | 0100 | | | | 1011 | | | | | 1010 | | | | | 1001 | | | | | 1000 | | | | |------+------+------+------|Above is a simplified example for one (small) register. Each value from the input data is first deterministically hashed to the domain
[20, 24). The likeliness for the hashed value to occur is then determined by the index of the leftmost 1 from the binary representation. So there is a 50% chance that the first digit will be a 1, a 25% chance that only the second digit will be a 1 and so on.
To estimate the number of distinct elements in a set, only the maximum of these indexes needs to be stored at any time. The more truly distinct values there are, the more likely a higher index is stored. If the maximum index is 1, the algorithm would estimate that there are 20 distinct elements, if the maximum index is 2, the estimation would be 21 and so on. Accuracy of the result can then be improved by increasing the number of registers, and averaging each of their estimations in a smart way.
From that, the properties of the algorithm follow more or less directly. Instead of storing all distinct elements, the algorithm only needs to store a constant number of indexes. These can easily be calculated in parallel for different parts of the data, and then be merged by selecting the maximum index across all parts.
This can be done quite well in SQL: we first define the
b, which defines the accuracy of the
result. We take the
alpha parameter from
Then we first map each input value to a random number
WITH random_numbers AS ( SELECT (random() * (1 << 16)) :: INT AS n FROM generate_series(1, 1 << 14) ), params AS ( SELECT b, 2^b :: INT AS m, CASE WHEN b = 4 THEN 0.673 WHEN b = 5 THEN 0.697 WHEN b = 6 THEN 0.709 ELSE 0.7213 / (1.0 + (1.079 / 2^b)) END AS alpha, 1.04 / sqrt(2^b) AS expected_accuracy FROM (SELECT 8 AS b) p ),
xusing the MD5 hash. The first
bbits of that number are then used to determine the register for the input value, and the remaining bits are used to calculate how unlikely the random number was. For each register, we store the most unlikely value:
And finally obtain the estimated number of distinct elements by averaging results across registers:
registers AS ( SELECT (x & ((1 << b) - 1)) AS register, max((62 - b) - floor(log(2, ((x & ((1 << 62) - 1)) >> b)))) :: INT AS p FROM ( SELECT ('x' || substring(md5(n :: VARCHAR), 1, 16)) :: BIT(64) :: BIGINT AS x FROM random_numbers ) t CROSS JOIN params GROUP BY register ),
SELECT ((alpha * m * m) / Z) :: BIGINT AS num_estimated_distinct_elements FROM ( SELECT sum(1.0 / (1 << p)) AS Z FROM registers ) t CROSS JOIN params
The calculation here is pretty much embarrasingly parallel, so a good query planner can probably figure that out. One caveat however: in practice the data is probably not distributed by the registers.
The complete query is also on GitHub.
Some additional notes:
- The implementation and description here isn't really the most efficient/accurate/correct one and mostly meant to gain some intution about the algorithm. Read the original paper for the details.
- The standard HyperLogLog algorithm can with reasonable parameters achieve a result accuracy of 1-2%. There are some ways to improve this accuracy even further, but if there is enough data to justify an approximated distinct count, how much error is already in the data anyway?
- Probabilistic algorithms can generally be a bit tricky
in practice. Notably the parameter
balso defines the maximum number of distinct values that can be estimated. Also the result accuracy gets worse if there actually aren't too many distinct values, which may be an issue.
- The HyperLogLog is meant to use very little memory, but don't underestimate the additional compute resources it requires, or the storage requirements if used as an aggregate table.
- While the general idea of the algorithm is somewhat
intuitive, the averaging required to achieve the stated
accuracy requires a fair bit of
magicmath. But it seems to work quite well :)