## HyperLogLog in SQL

Counting the number of distinct elements within a set is a simple task. Yet it can be surprisingly difficult to do this efficiently at scale. That is because there is no obvious way to store intermediate results of a distinct count, except for actually storing the distinct values seen so far. This is expensive in two ways: it needs memory linear to the number of distinct values. And there is no trivial way to merge two partial intermediate results, so the calculation doesn't parallelize well.

The HyperLogLog
algorithm^{[1]} is an
approximation for the exact distinct count, with configurable
accuracy. It has two main properties that make it very
interesting: it only uses very little memory, and it can
easily be parallelized. The latter implies that it also can be
used to pre-aggregate intermediate results.

count(id) | count(DISTINCT id) | HyperLogLog(id) | |
---|---|---|---|

Time | O(n) | O(n) | O(n) |

Memory | O(1) | O(n) | O(1) |

Parallel | embarrasingly | No | embarrasingly |

Error | - | - | typically ≈1-2% |

So the HyperLogLog basically reduces the computational
complexity of a distinct count back to a normal count, while
adding a small error to the result. However it is *not*
necessarily faster than a regular distinct count - in fact
often the opposite can be true in practice, since the
actual computational work
increases quite a bit. Nevertheless a speedup is possible in
the following cases:

- the implementation is heavily parallelized.
- the reduced memory usage prevents swapping.
- the intermediate results are used as pre-aggregation to speedup online queries.

**The high-level idea of the algorithm is the following:
input data is randomly but deterministically mapped to the
values of a given random variable, according to its
distribution. Only the most unlikely value generated with this
process is stored in a register. The more of an outlier that
value is, the more distinct elements there probably are. This
is done at the same time for multiple registers with different
random mappings, and their results are averaged to improve the
accuracy.**

|------+------+------+------| | 2Above is a simplified example for one (small) register. Each value from the input data is first deterministically hashed to the domain^{0}| 2^{1}| 2^{2}| 2^{3}| |------+------+------+------| | 1111 | 0111 | 0011 | 0001 | | 1110 | 0110 | 0010 | | | 1101 | 0101 | | | | 1100 | 0100 | | | | 1011 | | | | | 1010 | | | | | 1001 | | | | | 1000 | | | | |------+------+------+------|

`[2`^{0}, 2^{4})

. The
likeliness for the hashed value to occur is then determined by
the index of the leftmost 1 from the binary representation. So
there is a 50% chance that the first digit will be a 1, a 25%
chance that only the second digit will be a 1 and so on.To estimate the number of distinct elements in a set, only the maximum of these indexes needs to be stored at any time. The more truly distinct values there are, the more likely a higher index is stored. If the maximum index is 1, the algorithm would estimate that there are 2

^{0}distinct elements, if the maximum index is 2, the estimation would be 2

^{1}and so on. Accuracy of the result can then be improved by increasing the number of registers, and averaging each of their estimations in a smart way.

Run simulation Pause simulation Reset

From that, the properties of the algorithm follow more or less directly. Instead of storing all distinct elements, the algorithm only needs to store a constant number of indexes. These can easily be calculated in parallel for different parts of the data, and then be merged by selecting the maximum index across all parts.

This can be done quite well in SQL: we first define the
parameter `b`

, which defines the accuracy of the
result. We take the `alpha`

parameter from

```
WITH random_numbers AS (
SELECT (random() * (1 << 16)) :: INT AS n FROM generate_series(1, 1 << 14)
),
params AS (
SELECT
b,
2^b :: INT AS m,
CASE
WHEN b = 4 THEN 0.673
WHEN b = 5 THEN 0.697
WHEN b = 6 THEN 0.709
ELSE 0.7213 / (1.0 + (1.079 / 2^b))
END AS alpha,
1.04 / sqrt(2^b) AS expected_accuracy
FROM (SELECT 8 AS b) p
),
```

Then we first map each input value to a random number `x`

using the MD5 hash. The first `b`

bits of that number are
then used to determine the register for the input value, and the
remaining bits are used to calculate how unlikely the random number
was. For each register, we store the most unlikely value:
```
registers AS (
SELECT
(x & ((1 << b) - 1)) AS register,
max((62 - b) - floor(log(2, ((x & ((1 << 62) - 1)) >> b)))) :: INT AS p
FROM (
SELECT
('x' || substring(md5(n :: VARCHAR), 1, 16)) :: BIT(64) :: BIGINT AS x
FROM random_numbers
) t
CROSS JOIN params
GROUP BY register
),
```

And finally obtain the estimated number of distinct elements by
averaging results across registers:
```
SELECT
((alpha * m * m) / Z) :: BIGINT AS num_estimated_distinct_elements
FROM (
SELECT
sum(1.0 / (1 << p)) AS Z
FROM registers
) t
CROSS JOIN params
```

The calculation here is pretty much embarrasingly parallel, so a good query planner can probably figure that out. One caveat however: in practice the data is probably not distributed by the registers.

The complete query is also on GitHub.

Some additional notes:

- The implementation and description here isn't really
the most efficient/accurate/correct one and mostly meant
to gain some intution about the algorithm. Read the
original paper
^{[1]}for the details. - The standard HyperLogLog algorithm can with reasonable parameters achieve a result accuracy of 1-2%. There are some ways to improve this accuracy even further, but if there is enough data to justify an approximated distinct count, how much error is already in the data anyway?
- Probabilistic algorithms can generally be a bit tricky
in practice. Notably the parameter
`b`

also defines the maximum number of distinct values that can be estimated. Also the result accuracy gets worse if there actually aren't too many distinct values, which may be an issue. - The HyperLogLog is meant to use very little memory, but don't underestimate the additional compute resources it requires, or the storage requirements if used as an aggregate table.
- While the general idea of the algorithm is somewhat
intuitive, the averaging required to achieve the stated
accuracy requires a fair bit of
advanced
~~magic~~math. But it seems to work quite well :)

References: