Data layout in distributed column stores
It is arguably very convenient to query data using some high-level language without knowing where the data actually is or how it is stored. More and more systems are build which (partially) allow to do so, even across different data formats. If data can be queried directly as-is, there is a clear advantage of eliminating integration overhead.
However, a lot of collected data is analytical data, which is written only once but read much more often. For this type of data it can make sense to store it in a way that is optimized for querying. While this typically brings a certain overhead in deciding how to store the data and actually writing it, the more often it is read the more this overhead can pay off. Provided the query processing engine is optimized for such kind of storage[1].
Therefore analytical data is frequently stored in columnar formats. That basically means that not all attributes of a specific data point are stored together, but instead all values (of different data points) for a specific attribute are stored together. The reasoning is the following:
- While each data point commonly has a lot of attributes, many individual analytical queries for the data only require a small subset of them. Storing data for each attribute separately makes it easier to avoid reading data which a query does not need.
- Many operations in analytical queries are batch operations on attributes, which can profit a lot from hardware support for SIMD operations. Columnar storage for attributes is supporting this workload very well.
- Values of an attribute likely have patterns which allow very effective and/or lightweight compression schemes to be used for a column. Each attribute can easily use a different type of compression, according to the actual values or use cases of the column.
Additionally a lot of analytical data is also stored in a distributed way on different storage devices which are connected via some kind of network. Since transferring data over the network is in general even slower than reading from disk, this is something that should be avoided even more than just reading data.
Designing a data layout for relations in a distributed column store (that is optimized for querying the data) then basically means deciding between the various trade-offs of the following parameters:
- Data distribution, to specify which parts of the data will be stored on which storage node. Two main things are important to consider here: data distribution should not be too skewed, i.e. ideally each node should store a similar amount of data. And ideally data is distributed in such a way that frequent operations can happen mostly local on each node and only need minimal network bandwidth.
- Sort order, to specify how data on each storage node is sorted. While sorting data has an obvious additional cost during inserts of new data, it is still a relatively cheap index that does not require additional storage. The most important thing to consider here is which columns should be used for the sort order: often different queries could benefit from completely different sort orders.
- Column compression, to specify how individual columns should be compressed. The main thing to consider here is the impact of the compression scheme on query run-time compared to the storage savings. While column compression sometimes both saves storage space and speeds up queries, it is often also a decision between both; and worst case the compression scheme can even increase the required storage, while making queries slower.
Now while decisions for all of these parameters already have their internal pros and cons, combining them results in even more trade-offs. Some examples:
- If data distribution is chosen such that all data points with the same value for some column will be on the same node, it may be very helpful to use this column for the sort order as well. If data distribution is however chosen such that data points with the same value for the column will be on different nodes, it may be less valuable as sort column.
- If the relation is sorted in such a way that a certain column contains many repeating values, run-length encoding can be a very good compression schema for that column. If the relation is however sorted by a different order such that there are almost no repeating values, run-length encoding can even increase the storage since it now has to store a tuple of values for almost every row.
- If some column is compressed using delta encoding, it is important that most differences between consecutive values are rather small. This usually implies that the relation is somehow sorted by this column. However even in that case the differences between values on each node may be larger than expected, if the data distribution is chosen in a way that stores following values on different nodes.
Finding the absolutely best combination between all of these contradicting parameters is generally quite difficult, and depends on the exact use case for the data anyway. A usually good enough heuristic is to simply choose each parameter step-by-step in order of their importance:
- First define how the data should be distributed. This should make sure that distribution is not skewed, and clarify which parts of the data will be on the same storage node.
- Then decide how data should be sorted on each node.
- Finally, choose column compression based on previously defined data distribution and sort order.
Below is a simple interactive demo to see the effects of different data distribution, sorting and column compression types for a system with three different storage nodes. While the actual storage format and parameters in most real systems are usually more complicated, it can still give a good idea for how the storage is organized.
The text input has some sample data, for which different distributions, sort orders and column compressions can be tested. Note that every 8th row starts a new block, which disrupts compression schemes like Runlength and Delta to allow decompression without reading all the data. In practice these blocks should be large enough to have little impact on the actual compression ratio.
You can also enter your own example data here, but note that parsing is not very sophisticated. So your data should have a header column, and use a comma as delimiter without quote characters (i.e. a comma will always be interpreted as delimiter). The demo is also not particularly efficient, so do not enter more than maybe a few hundred rows.
References: