Apr 2019 Rational Exuberance

Data layout in distributed column stores

It is arguably very convenient to query data using some high-level language without knowing where the data actually is or how it is stored. More and more systems are build which (partially) allow to do so, even across different data formats. If data can be queried directly as-is, there is a clear advantage of eliminating integration overhead.

However, a lot of collected data is analytical data, which is written only once but read much more often. For this type of data it can make sense to store it in a way that is optimized for querying. While this typically brings a certain overhead in deciding how to store the data and actually writing it, the more often it is read the more this overhead can pay off. Provided the query processing engine is optimized for such kind of storage[1].

Therefore analytical data is frequently stored in columnar formats. That basically means that not all attributes of a specific data point are stored together, but instead all values (of different data points) for a specific attribute are stored together. The reasoning is the following:

Additionally a lot of analytical data is also stored in a distributed way on different storage devices which are connected via some kind of network. Since transferring data over the network is in general even slower than reading from disk, this is something that should be avoided even more than just reading data.

Designing a data layout for relations in a distributed column store (that is optimized for querying the data) then basically means deciding between the various trade-offs of the following parameters:

Now while decisions for all of these parameters already have their internal pros and cons, combining them results in even more trade-offs. Some examples:

Finding the absolutely best combination between all of these contradicting parameters is generally quite difficult, and depends on the exact use case for the data anyway. A usually good enough heuristic is to simply choose each parameter step-by-step in order of their importance:

  1. First define how the data should be distributed. This should make sure that distribution is not skewed, and clarify which parts of the data will be on the same storage node.
  2. Then decide how data should be sorted on each node.
  3. Finally, choose column compression based on previously defined data distribution and sort order.
For the first two steps it is important to know what kind of queries will be run (most frequently) against the data. Once the data distribution and sort order is determined, the options for column compression are more or less determined as well though. The decision is then mostly if the focus should be on a more heavyweight compression to save disk space, or to use more lightweight compression that could also speed up queries.

Below is a simple interactive demo to see the effects of different data distribution, sorting and column compression types for a system with three different storage nodes. While the actual storage format and parameters in most real systems are usually more complicated, it can still give a good idea for how the storage is organized.

The text input has some sample data, for which different distributions, sort orders and column compressions can be tested. Note that every 8th row starts a new block, which disrupts compression schemes like Runlength and Delta to allow decompression without reading all the data. In practice these blocks should be large enough to have little impact on the actual compression ratio.

You can also enter your own example data here, but note that parsing is not very sophisticated. So your data should have a header column, and use a comma as delimiter without quote characters (i.e. a comma will always be interpreted as delimiter). The demo is also not particularly efficient, so do not enter more than maybe a few hundred rows.