March 2018 Rational Exuberance

Counting the number of distinct elements within a set is a simple task. Yet it can be surprisingly difficult to do this efficiently at scale. That is because there is no obvious way to store intermediate results of a distinct count, except for actually storing the distinct values seen so far. This is expensive in two ways: it needs memory linear to the number of distinct values. And there is no trivial way to merge two partial intermediate results, so the calculation doesn't parallelize well.

The HyperLogLog algorithm is an approximation for the exact distinct count, with configurable accuracy. It has two main properties that make it very interesting: it only uses very little memory, and it can easily be parallelized. The latter implies that it also can be used to pre-aggregate intermediate results.

count(id)count(DISTINCT id)HyperLogLog(id)
TimeO(n)O(n)O(n)
MemoryO(1)O(n)O(1)
ParallelembarrasinglyNoembarrasingly
Error--typically ≈1-2%

So the HyperLogLog basically reduces the computational complexity of a distinct count back to a normal count, while adding a small error to the result. However it is not necessarily faster than a regular distinct count - in fact often the opposite can be true in practice, since the actual computational work increases quite a bit. Nevertheless a speedup is possible in the following cases:

• the implementation is heavily parallelized.
• the reduced memory usage prevents swapping.
• the intermediate results are used as pre-aggregation to speedup online queries.

The high-level idea of the algorithm is the following: input data is randomly but deterministically mapped to the values of a given random variable, according to its distribution. Only the most unlikely value generated with this process is stored in a register. The more of an outlier that value is, the more distinct elements there probably are. This is done at the same time for multiple registers with different random mappings, and their results are averaged to improve the accuracy.

```|------+------+------+------|
|   20 |   21 |   22 |   23 |
|------+------+------+------|
| 1111 | 0111 | 0011 | 0001 |
| 1110 | 0110 | 0010 |      |
| 1101 | 0101 |      |      |
| 1100 | 0100 |      |      |
| 1011 |      |      |      |
| 1010 |      |      |      |
| 1001 |      |      |      |
| 1000 |      |      |      |
|------+------+------+------|```
Above is a simplified example for one (small) register. Each value from the input data is first deterministically hashed to the domain `[20, 24)`. The likeliness for the hashed value to occur is then determined by the index of the leftmost 1 from the binary representation. So there is a 50% chance that the first digit will be a 1, a 25% chance that only the second digit will be a 1 and so on.
To estimate the number of distinct elements in a set, only the maximum of these indexes needs to be stored at any time. The more truly distinct values there are, the more likely a higher index is stored. If the maximum index is 1, the algorithm would estimate that there are 20 distinct elements, if the maximum index is 2, the estimation would be 21 and so on. Accuracy of the result can then be improved by increasing the number of registers, and averaging each of their estimations in a smart way.

Simulation of a HyperLogLog approximation for a randomly generated series of values between 0 and 127. The horizontal bar shows how many of the 128 possible values have already been drawn, while the 8 registers below show the state of the algorithm. Highlighted bars of each register are expected to move from the larger bars on the left to the smaller bar on the right the more values are drawn, to indicate that increasingly unlikely values are included. Note that the approximated value doesn't change if previously seen values are drawn, since the algorithm is deterministic. The order in which the random numbers are drawn does matter however, which is why the graphs can look different across multiple runs.

From that, the properties of the algorithm follow more or less directly. Instead of storing all distinct elements, the algorithm only needs to store a constant number of indexes. These can easily be calculated in parallel for different parts of the data, and then be merged by selecting the maximum index across all parts.

This can be done quite well in SQL: we first define the parameter `b`, which defines the accuracy of the result. We take the `alpha` parameter from HyperLogLog in Practice , and generate some random data to test the implementation:

``````WITH random_numbers AS (
SELECT (random() * (1 << 16)) :: INT AS n FROM generate_series(1, 1 << 14)
),
params AS (
SELECT
b,
2^b :: INT AS m,
CASE
WHEN b = 4 THEN 0.673
WHEN b = 5 THEN 0.697
WHEN b = 6 THEN 0.709
ELSE 0.7213 / (1.0 + (1.079 / 2^b))
END AS alpha,
1.04 / sqrt(2^b) AS expected_accuracy
FROM (SELECT 8 AS b) p
),
``````
Then we first map each input value to a random number `x` using the MD5 hash. The first `b` bits of that number are then used to determine the register for the input value, and the remaining bits are used to calculate how unlikely the random number was. For each register, we store the most unlikely value:
``````registers AS (
SELECT
(x & ((1 << b) - 1)) AS register,
max((62 - b) - floor(log(2, ((x & ((1 << 62) - 1)) >> b)))) :: INT AS p
FROM (
SELECT
('x' || substring(md5(n :: VARCHAR), 1, 16)) :: BIT(64) :: BIGINT AS x
FROM random_numbers
) t
CROSS JOIN params
GROUP BY register
),
``````
And finally obtain the estimated number of distinct elements by averaging results across registers:
``````SELECT
((alpha * m * m) / Z) :: BIGINT AS num_estimated_distinct_elements
FROM (
SELECT
sum(1.0 / (1 << p)) AS Z
FROM registers
) t
CROSS JOIN params
``````

The calculation here is pretty much embarrasingly parallel, so a good query planner can probably figure that out. One caveat however: in practice the data is probably not distributed by the registers.

The complete query is also on GitHub.

• Probabilistic algorithms can generally be a bit tricky in practice. Notably the parameter `b` also defines the maximum number of distinct values that can be estimated. Also the result accuracy gets worse if there actually aren't too many distinct values, which may be an issue.