www.foldl.de/blogs/re/join-algo

Feb 2019 Rational Exuberance

A short overview of JOIN algorithms

In SQL the JOIN^[1] is arguably the most important operator to combine data from different tables. Logically there are different join types, i.e. LEFT/RIGHT, INNER/OUTER and CROSS JOIN's which specify the result of the operator exactly. Yet since SQL is a declarative language, they do not specify how the result is obtained.

There has been a lot of research on how to improve join implementations^[2], and writing an implementation that has competitive performance on modern hardware is far from trivial^[3]. Many optimizations for specific subsets of the generic join operations do exist as well. Most common implementations are variants of three basic algorithms though: the nested loop join, hash join and merge join. All of them can be used to implement any of the different join types.

Many current databases implement all three of these algorithms and let the query planner decide which one to use for each join in a given query. As a user, you ideally do not need to care about which of them is used, since they all produce the same result. However depending on the data in both tables and the join keys, each of the algorithms can have very different performance patterns. Understanding the performance of the join algorithm chosen by the query planner can often be the key to understanding the performance of the whole query.

We'll first look at the most simple algorithm, the nested loop join. And we'll further simplify by restricting the implementation to an inner join of in-memory integer arrays without duplicates, testing for equality of the join key. Essentially this means that we look at a function which gets two arrays of integers as parameters, and returns an array of integers which occur in both input arrays. Except that we'll actually use lists instead of arrays.

import random

x = random.sample(range(40), 20)
y = random.sample(range(40), 10)

def simplified_nested_loop_join(x, y):
    result = []
    for a in x:
        for b in y:
            if a == b:
                result.append(a)
    return result

print('Input:'), print(x), print(y)
result = simplified_nested_loop_join(x, y)
print(Result:'), print(result)

Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[3, 28, 15, 19, 6, 8]

The implementation is not very complicated and doesn't need a lot of additional memory, but also has a huge disadvantage: the runtime is independent of the data in the tables always quadratic (or more specifically: O(|x| × |y|)). Since in practice most joins can be much faster, the nested loop join is therefore often not a good choice. Additionally the input tables are not necessarily always kept in memory as in the simple example here, which then could result in repeated (slow) disk access for the table from the inner loop.

Due to its simplicity the nested loop join can nevertheless still be helpful in some situations. For example if the join condition matches for nearly all of the rows (i.e. it is almost equivalent to a cross join), the runtime is effectively quadratic anyway and the nested loop join may provide the least overhead of all implementations. Or if one of the tables is very small, a better implementation would probably choose the small table for the inner loop. On many architectures this could then lead to very efficient machine code, where the small table is kept in some very fast CPU cache with few cache misses (implicitly emulating a hash join).

If two relatively large tables with only few matches for each row are joined though, the hash join may be a better option. As indicated by the name, it first creates a hash table as index structure for lookups in one of the tables. It then iterates over the other table and uses the hash table to determine if the join condition matches for the rows.

def simplified_hash_join(x, y):
    result = []

    hash_table = {}
    for a in x:
        hash_table[a] = True

    for b in y:
        if b in hash_table:
            result.append(b)

    return result

Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[6, 3, 28, 15, 8, 19]

If as in the example here the join keys are unique and the hash table can avoid too many collisions, the runtime of the hash join is linear (or more specifically: O(|x| + |y|)) which is quite an improvement compared to the nested loop join. This comes at a certain cost however: memory usage increases, since the additional hash table is needed. An actual implementation would likely try to reduce memory usage by checking which of the input tables is smaller, and then generate the hash table for that smaller table (where smaller may either mean the least unique join key values, or the least amount of rows).

But even then the size of the smaller table can still exceed the available memory. A slightly more complicated implementation in the example here could therefore try to reduce the memory usage further by partitioning both tables with the same hash function before the join, and then running the hash join independently for each partition. Eventually the results from each partition are appended to obtain the actual result of the join.

MEMORY_LIMIT, MAX_RECURSION_LIMIT = 8, 3

def simplified_hash_join_with_memory_limit(x, y, level=1):
    if level > MAX_RECURSION_LIMIT:
        raise Exception('Can not compute JOIN due to memory constraints!')

    smaller_table = x if len(x) <= len(y) else y
    larger_table = x if len(x) > len(y) else y

    if len(smaller_table) > MEMORY_LIMIT:
        results = []
        partitions = int(len(smaller_table) / MEMORY_LIMIT) + 1
        for partition in range(partitions):
            results.extend(simplified_hash_join_with_memory_limit(
                list(filter(lambda a: (level * a) % partitions == partition, x)),
                list(filter(lambda b: (level * b) % partitions == partition, y)),
                level=level+1
            ))
        return results

    result = []
    hash_table = {a: True for a in smaller_table}
    for b in larger_table:
        if b in hash_table:
            result.append(b)
    return result

Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[28, 6, 8, 3, 15, 19]

Note that this would either require some indexed access to the original tables, or multiple passes over the data. Also it can happen that some partitions will still have too many rows to fit into memory after the first partition step, which is why we need the recursion - and worst case no hash function will work, because the table has too many rows with the same value for the join key.

The hash join also becomes more complicated, if the join condition is not equality but i.e. a range of values. For simple cases it may be possible to enumerate the range values and run a lookup for each of them. This quickly becomes infeasible though if the range includes many values, and would essentially require a different index structure than a regular hash table to lookup range values instead. However while order preserving hash functions do exist, most databases tend use the hash join only for equality joins.

For range joins, or if memory usage is a problem, the merge join can be a better alternative. This implementation relies on both tables being sorted by the join key. It can then iterate step by step through both tables at the same time, always advancing the iterator from the table that is lagging.

def simplified_merge_join(x, y):
    result = []

    xs, ys = iter(sorted(x)), iter(sorted(y))
    xv, yv = next(xs, None), next(ys, None)

    while xv is not None and yv is not None:
        if xv == yv:
            result.append(xv)
            xv, yv = next(xs, None), next(ys, None)
        elif xv < yv:
            xv = next(xs, None)
        elif yv < xv:
            yv = next(ys, None)

    return result

Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[3, 6, 8, 15, 19, 28]

Once the tables are sorted, the runtime is linear (as again in this example the join keys are unique) and does not need a lot of additional memory. Sorting both tables is usually the bottleneck of this join (more specifically: O(n log n) with n = max(|x|, |y|)), and can also cause problems with memory usage. While there are relatively efficient algorithms for sorting with limited memory^[4], it is nevertheless sometimes seen as separate step from the merge join itself. A more realistic implementation would in any case always at least check if the tables are already sorted. If they are already sorted, or have some other index structure which allows sequential access of the join key values, the merge join is often the best choice.

Summarizing the basic join algorithms:

Nested loop join:
- Pro: simple implementation can lead to best performance in special cases. Memory usage is always low.
- Con: runtime performance not competitive outside of special cases. Requires repeated passes over data.
Hash join:
- Pro: fast implementation which does not rely on a specific data layout.
- Con: high memory usage, which may be infeasible. Join conditions other than equality can cause issues.
Merge join:
- Pro: fast implementation with low memory usage if tables allow sequential access of join key values.
- Con: if sorting of the tables is required this will usually be the bottleneck, and is often slower than a hash join in that case.

Some additional notes:

The implementations shown here are of course very simplified to demonstrate the idea of each algorithm. In more realistic implementations they are quite a bit more complicated both technically and logically. Technically because they for example have to deal with table rows instead of just integer values, handle duplicate join keys in tables and usually can not just forward the entire input tables as function parameters. Logically because they for example often have to guesstimate parameters like the table size and number of unique join keys from metadata while considering possibly available index structures or other special cases that would allow shortcuts.
Note that the order of the values in the result array from each algorithm is different. Since the JOIN operator does not specify a certain order of values to be returned, each implementation can simply return the most convenient order. Within a more complicated query, the order of the result set may influence following operators and therefore the decision for which join to choose. Especially interesting here is that the output of a merge join is sorted, which may amortize the cost of sorting tables before the join.
The basic join algorithms all work for each of the logically different join types (i.e. LEFT/RIGHT, INNER/OUTER) with only minimal modifications. With more optimizations, they can easily add more complexity though. For example choosing the smaller table for the hash table in a hash join does not have much impact for the following steps of an INNER JOIN, but for a LEFT JOIN this would mean that the following steps are a bit different depending on if the left or the right table was used for the hash table.

References: