Jun 2019 Rational Exuberance

Query plans and planning

Running an SQL query usually requires many of the same steps an interpreter[1] performs when running a program. A database needs to parse the query, generate code for it, and eventually run the code. Since directly generating efficient code for a query is difficult, many databases first generate a query plan as an intermediate representation of the code. Such a query plan can then either be run directly via an integrated runtime system (often a version of the iterator model), or be further compiled to machine code[2][3].

As an example, let's take a look at the following query:

SELECT
  c.category_name,
  sum(t.value) AS total_value
FROM transactions t
JOIN categories c USING (category_id)
WHERE t.value > 0
GROUP BY c.category_name
ORDER BY total_value DESC

An implementation in Python of the same query could be the following:

result = defaultdict(int)

with open('categories') as fc, open('transactions', 'r') as ft:
    category_reader = csv.reader(fc)
    categories = {row[0]: row[1] for row in category_reader}

    transaction_reader = csv.reader(ft)
    for _, category_id, value in transaction_reader:
        if float(value) > 0:
            result[categories[category_id]] += float(value)

result = sorted(result.items(), key=lambda x: x[1], reverse=True)

This implementation should be reasonably fast and memory efficient:

However, even for this rather simple example it is not necessarily obvious how to automatically generate the above code from the query.

Especially considering a convenient optimization that is included in the following line within the loop: result[categories[category_id]] += float(value). This performs both the join and the aggregation in one step, which means the result of the join is never fully materialized. Therefore running the query requires only very little memory.

Instead of directly generating the code, most databases first generate a query plan consisting of a set of standard building blocks. These building blocks include operators like reading and sorting tables, building hash tables or specific join implementations. Two possible query plans for the example query can be seen here:

While the plans are not identical, both still follow more or less directly from the abstract syntax tree of the query. This makes it easier to generate them than the direct translation to code. There are two key differences between the plans:

Figuring out which of the generally many possible query plans is the best (or at least reasonably good) for a given query is often a crucial step for query performance. Different plans can either be generated directly from the query, or by modifying existing plans. The query planner then needs to decide which of the plans should actually be used.

For the result of the example query it does not matter if transactions are filtered before the join as in the first plan, or afterwards like in the second plan. From a performance perspective it is very likely better to filter before the join though. The filter may reduce the number of rows quite a bit, and therefore also reduce the cost for the following sort and/or join operation. Such a predicate pushdown within the plan, i.e. applying filters as early as possible, is often one of the easiest yet still very effective technique for automated query optimization.

The case for the join operation is a bit more complicated. With the knowledge (via up-to-date table statistics) that the categories table is small and the transactions table is large and not sorted, the query planner can decide that allocating memory for the category hash table is probably better than sorting the transactions table though. Therefore such optimizations can be automated by encoding that both join operations are equivalent, and having some cost heuristic for both of them.

Now the first plan is actually very similar to the manually written Python code, except that the join and the aggregation are two separate operators. This doesn't mean that the join result needs to be materialized though. The planner can decide to pipeline the join and the aggregation steps. Which can either mean that intermediate results from the join are immediately forwarded to the aggregation operator, or also that machine code is generated for a combination of both operators.

This can not easily be done for arbitrary combinations of operators though, since for example the hash join is a pipeline breaker. In the example, it can not yield any results before fully reading the hash table for the categories. Therefore there is little reason to pipeline the operators for hashing the categories and performing the hash join. Automating such optimizations can therefore be simplified by encoding which operators are pipeline breakers.

Generating optimized machine code for pipelined operators still isn't trivial, but nevertheless reduces the scope of code generation considerably. Running machine code instead of the iterator model runtime can remove overhead like frequent function calls and allow more hardware focused code optimizations, and therefore lead to significant performance improvements. Still the query plan is very helpful as intermediate representation, both to optimize the actual operator steps and their implementation.


References: