www.foldl.de/blogs/re/hll

March 2018 Rational Exuberance

HyperLogLog in SQL

Counting the number of distinct elements within a set is a simple task. Yet it can be surprisingly difficult to do this efficiently at scale. That is because there is no obvious way to store intermediate results of a distinct count, except for actually storing the distinct values seen so far. This is expensive in two ways: it needs memory linear to the number of distinct values. And there is no trivial way to merge two partial intermediate results, so the calculation doesn't parallelize well.

The HyperLogLog algorithm^[1] is an approximation for the exact distinct count, with configurable accuracy. It has two main properties that make it very interesting: it only uses very little memory, and it can easily be parallelized. The latter implies that it also can be used to pre-aggregate intermediate results.

	count(id)	count(DISTINCT id)	HyperLogLog(id)
Time	O(n)	O(n)	O(n)
Memory	O(1)	O(n)	O(1)
Parallel	embarrasingly	No	embarrasingly
Error	-	-	typically ≈1-2%

So the HyperLogLog basically reduces the computational complexity of a distinct count back to a normal count, while adding a small error to the result. However it is not necessarily faster than a regular distinct count - in fact often the opposite can be true in practice, since the actual computational work increases quite a bit. Nevertheless a speedup is possible in the following cases:

the implementation is heavily parallelized.
the reduced memory usage prevents swapping.
the intermediate results are used as pre-aggregation to speedup online queries.

The high-level idea of the algorithm is the following: input data is randomly but deterministically mapped to the values of a given random variable, according to its distribution. Only the most unlikely value generated with this process is stored in a register. The more of an outlier that value is, the more distinct elements there probably are. This is done at the same time for multiple registers with different random mappings, and their results are averaged to improve the accuracy.

|------+------+------+------|
|   2⁰ |   2¹ |   2² |   2³ |
|------+------+------+------|
| 1111 | 0111 | 0011 | 0001 |
| 1110 | 0110 | 0010 |      |
| 1101 | 0101 |      |      |
| 1100 | 0100 |      |      |
| 1011 |      |      |      |
| 1010 |      |      |      |
| 1001 |      |      |      |
| 1000 |      |      |      |
|------+------+------+------|

Above is a simplified example for one (small) register. Each value from the input data is first deterministically hashed to the domain [2⁰, 2⁴). The likeliness for the hashed value to occur is then determined by the index of the leftmost 1 from the binary representation. So there is a 50% chance that the first digit will be a 1, a 25% chance that only the second digit will be a 1 and so on.
To estimate the number of distinct elements in a set, only the maximum of these indexes needs to be stored at any time. The more truly distinct values there are, the more likely a higher index is stored. If the maximum index is 1, the algorithm would estimate that there are 2⁰ distinct elements, if the maximum index is 2, the estimation would be 2¹ and so on. Accuracy of the result can then be improved by increasing the number of registers, and averaging each of their estimations in a smart way.

Run simulation Pause simulation Reset

Simulation of a HyperLogLog approximation for a randomly generated series of values between 0 and 127. The horizontal bar shows how many of the 128 possible values have already been drawn, while the 8 registers below show the state of the algorithm. Highlighted bars of each register are expected to move from the larger bars on the left to the smaller bar on the right the more values are drawn, to indicate that increasingly unlikely values are included. Note that the approximated value doesn't change if previously seen values are drawn, since the algorithm is deterministic. The order in which the random numbers are drawn does matter however, which is why the graphs can look different across multiple runs.

From that, the properties of the algorithm follow more or less directly. Instead of storing all distinct elements, the algorithm only needs to store a constant number of indexes. These can easily be calculated in parallel for different parts of the data, and then be merged by selecting the maximum index across all parts.

This can be done quite well in SQL: we first define the parameter b, which defines the accuracy of the result. We take the alpha parameter from HyperLogLog in Practice [2], and generate some random data to test the implementation:

WITH random_numbers AS (
  SELECT (random() * (1 << 16)) :: INT AS n FROM generate_series(1, 1 << 14)
),
params AS (
  SELECT
    b,
    2^b :: INT AS m,
    CASE
      WHEN b = 4 THEN 0.673
      WHEN b = 5 THEN 0.697
      WHEN b = 6 THEN 0.709
      ELSE 0.7213 / (1.0 + (1.079 / 2^b))
    END AS alpha,
    1.04 / sqrt(2^b) AS expected_accuracy
  FROM (SELECT 8 AS b) p
),

Then we first map each input value to a random number x using the MD5 hash. The first b bits of that number are then used to determine the register for the input value, and the remaining bits are used to calculate how unlikely the random number was. For each register, we store the most unlikely value:

registers AS (
  SELECT
    (x & ((1 << b) - 1)) AS register,
    max((62 - b) - floor(log(2, ((x & ((1 << 62) - 1)) >> b)))) :: INT AS p
  FROM (
    SELECT
      ('x' || substring(md5(n :: VARCHAR), 1, 16)) :: BIT(64) :: BIGINT AS x
    FROM random_numbers
  ) t
  CROSS JOIN params
  GROUP BY register
),

And finally obtain the estimated number of distinct elements by averaging results across registers:

SELECT
  ((alpha * m * m) / Z) :: BIGINT AS num_estimated_distinct_elements
FROM (
  SELECT
    sum(1.0 / (1 << p)) AS Z
  FROM registers
) t
CROSS JOIN params

The calculation here is pretty much embarrasingly parallel, so a good query planner can probably figure that out. One caveat however: in practice the data is probably not distributed by the registers.

The complete query is also on GitHub.

Some additional notes:

The implementation and description here isn't really the most efficient/accurate/correct one and mostly meant to gain some intution about the algorithm. Read the original paper^[1] for the details.
The standard HyperLogLog algorithm can with reasonable parameters achieve a result accuracy of 1-2%. There are some ways to improve this accuracy even further, but if there is enough data to justify an approximated distinct count, how much error is already in the data anyway?
Probabilistic algorithms can generally be a bit tricky in practice. Notably the parameter b also defines the maximum number of distinct values that can be estimated. Also the result accuracy gets worse if there actually aren't too many distinct values, which may be an issue.
The HyperLogLog is meant to use very little memory, but don't underestimate the additional compute resources it requires, or the storage requirements if used as an aggregate table.
While the general idea of the algorithm is somewhat intuitive, the averaging required to achieve the stated accuracy requires a fair bit of advanced ~~magic~~ math. But it seems to work quite well :)

References: