Rational Exuberance

Jun 2019

Query plans and planning

A query plan is an intermediate representation of the code to actually run a query, and can be very helpful to optimize it.

Apr 2019

Data layout in distributed column stores

Data layout is important for distributed data processing engines, because it is usually the first (and sometimes only) index for efficient data access.

Mar 2019

Distributed k-means clustering in SQL

K-means is a simple and popular way to cluster data points. It can easily be implemented in SQL, and can be parallelized well with almost no additional effort in a distributed database.

Feb 2019

A short overview of JOIN algorithms

The join is probably the most important operator in SQL. Understanding its performance can often help a lot with understanding query performance in general.

Dec 2018

Visualizing timeline logs

From reading event log files, it is often not immediately obvious which events were happening in parallel. Gantt charts can be very helpful to visualize this information, but most libraries require to preprocess the data a bit to create dense charts.

Nov 2018

An easy way to find redundant edges in a DAG

Directed acyclic graphs are useful in many situations. Especially when displaying them, it can sometimes be useful to hide redundant edges. Fortunately there is a relatively easy way to find these edges. Even without SQL.

Oct 2018

Boosted tree predictions in SQL

Boosted trees are as of now one of the more successful machine learning models. And since they are basically just a standard tree data structure, it's actually quite easy to convert them to a relational model and compute predictions in SQL.

Sep 2018

KNN approximation with LSH in SQL

Locality sensitive hashing is kinda the opposite of the more well-known cryptographic hash functions. It's aim is to hash similar items into the same buckets. While this would be quite a disaster for cryptographic applications, it allows for a nice approximation of nearest neighbors. That can even be implemented in SQL.

Mar 2018

HyperLogLog in SQL

The HyperLogLog algorithm is a somewhat famous approximation for the distinct count aggregation that requires only very little memory. And while the accuracy analysis isn't necessarily all that obvious, the implementation is actually quite simple. Simple enough to be implemented directly in SQL.