Feb 2019 Rational Exuberance

A short overview of JOIN algorithms

In SQL the JOIN[1] is arguably the most important operator to combine data from different tables. Logically there are different join types, i.e. LEFT/RIGHT, INNER/OUTER and CROSS JOIN's which specify the result of the operator exactly. Yet since SQL is a declarative language, they do not specify how the result is obtained.

There has been a lot of research on how to improve join implementations[2], and writing an implementation that has competitive performance on modern hardware is far from trivial[3]. Many optimizations for specific subsets of the generic join operations do exist as well. Most common implementations are variants of three basic algorithms though: the nested loop join, hash join and merge join. All of them can be used to implement any of the different join types.

Many current databases implement all three of these algorithms and let the query planner decide which one to use for each join in a given query. As a user, you ideally do not need to care about which of them is used, since they all produce the same result. However depending on the data in both tables and the join keys, each of the algorithms can have very different performance patterns. Understanding the performance of the join algorithm chosen by the query planner can often be the key to understanding the performance of the whole query.

We'll first look at the most simple algorithm, the nested loop join. And we'll further simplify by restricting the implementation to an inner join of in-memory integer arrays without duplicates, testing for equality of the join key. Essentially this means that we look at a function which gets two arrays of integers as parameters, and returns an array of integers which occur in both input arrays. Except that we'll actually use lists instead of arrays.

import random

x = random.sample(range(40), 20)
y = random.sample(range(40), 10)

def simplified_nested_loop_join(x, y):
    result = []
    for a in x:
        for b in y:
            if a == b:
                result.append(a)
    return result

print('Input:'), print(x), print(y)
result = simplified_nested_loop_join(x, y)
print(Result:'), print(result)
Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[3, 28, 15, 19, 6, 8]

The implementation is not very complicated and doesn't need a lot of additional memory, but also has a huge disadvantage: the runtime is independent of the data in the tables always quadratic (or more specifically: O(|x| × |y|)). Since in practice most joins can be much faster, the nested loop join is therefore often not a good choice. Additionally the input tables are not necessarily always kept in memory as in the simple example here, which then could result in repeated (slow) disk access for the table from the inner loop.

Due to its simplicity the nested loop join can nevertheless still be helpful in some situations. For example if the join condition matches for nearly all of the rows (i.e. it is almost equivalent to a cross join), the runtime is effectively quadratic anyway and the nested loop join may provide the least overhead of all implementations. Or if one of the tables is very small, a better implementation would probably choose the small table for the inner loop. On many architectures this could then lead to very efficient machine code, where the small table is kept in some very fast CPU cache with few cache misses (implicitly emulating a hash join).

If two relatively large tables with only few matches for each row are joined though, the hash join may be a better option. As indicated by the name, it first creates a hash table as index structure for lookups in one of the tables. It then iterates over the other table and uses the hash table to determine if the join condition matches for the rows.

def simplified_hash_join(x, y):
    result = []

    hash_table = {}
    for a in x:
        hash_table[a] = True

    for b in y:
        if b in hash_table:
            result.append(b)

    return result
Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[6, 3, 28, 15, 8, 19]

If as in the example here the join keys are unique and the hash table can avoid too many collisions, the runtime of the hash join is linear (or more specifically: O(|x| + |y|)) which is quite an improvement compared to the nested loop join. This comes at a certain cost however: memory usage increases, since the additional hash table is needed. An actual implementation would likely try to reduce memory usage by checking which of the input tables is smaller, and then generate the hash table for that smaller table (where smaller may either mean the least unique join key values, or the least amount of rows).

But even then the size of the smaller table can still exceed the available memory. A slightly more complicated implementation in the example here could therefore try to reduce the memory usage further by partitioning both tables with the same hash function before the join, and then running the hash join independently for each partition. Eventually the results from each partition are appended to obtain the actual result of the join.

MEMORY_LIMIT, MAX_RECURSION_LIMIT = 8, 3

def simplified_hash_join_with_memory_limit(x, y, level=1):
    if level > MAX_RECURSION_LIMIT:
        raise Exception('Can not compute JOIN due to memory constraints!')

    smaller_table = x if len(x) <= len(y) else y
    larger_table = x if len(x) > len(y) else y

    if len(smaller_table) > MEMORY_LIMIT:
        results = []
        partitions = int(len(smaller_table) / MEMORY_LIMIT) + 1
        for partition in range(partitions):
            results.extend(simplified_hash_join_with_memory_limit(
                list(filter(lambda a: (level * a) % partitions == partition, x)),
                list(filter(lambda b: (level * b) % partitions == partition, y)),
                level=level+1
            ))
        return results

    result = []
    hash_table = {a: True for a in smaller_table}
    for b in larger_table:
        if b in hash_table:
            result.append(b)
    return result
Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[28, 6, 8, 3, 15, 19]

Note that this would either require some indexed access to the original tables, or multiple passes over the data. Also it can happen that some partitions will still have too many rows to fit into memory after the first partition step, which is why we need the recursion - and worst case no hash function will work, because the table has too many rows with the same value for the join key.

The hash join also becomes more complicated, if the join condition is not equality but i.e. a range of values. For simple cases it may be possible to enumerate the range values and run a lookup for each of them. This quickly becomes infeasible though if the range includes many values, and would essentially require a different index structure than a regular hash table to lookup range values instead. However while order preserving hash functions do exist, most databases tend use the hash join only for equality joins.

For range joins, or if memory usage is a problem, the merge join can be a better alternative. This implementation relies on both tables being sorted by the join key. It can then iterate step by step through both tables at the same time, always advancing the iterator from the table that is lagging.

def simplified_merge_join(x, y):
    result = []

    xs, ys = iter(sorted(x)), iter(sorted(y))
    xv, yv = next(xs, None), next(ys, None)

    while xv is not None and yv is not None:
        if xv == yv:
            result.append(xv)
            xv, yv = next(xs, None), next(ys, None)
        elif xv < yv:
            xv = next(xs, None)
        elif yv < xv:
            yv = next(ys, None)

    return result
Input:
[25, 3, 14, 12, 1, 31, 36, 28, 27, 5, 4, 10, 15, 18, 9, 19, 6, 8, 20, 29]
[7, 6, 3, 34, 28, 2, 15, 17, 8, 19]
Result:
[3, 6, 8, 15, 19, 28]

Once the tables are sorted, the runtime is linear (as again in this example the join keys are unique) and does not need a lot of additional memory. Sorting both tables is usually the bottleneck of this join (more specifically: O(n log n) with n = max(|x|, |y|)), and can also cause problems with memory usage. While there are relatively efficient algorithms for sorting with limited memory[4], it is nevertheless sometimes seen as separate step from the merge join itself. A more realistic implementation would in any case always at least check if the tables are already sorted. If they are already sorted, or have some other index structure which allows sequential access of the join key values, the merge join is often the best choice.

Summarizing the basic join algorithms:

Some additional notes:


References: