Query plans and planning
A query plan is an intermediate representation of the code to actually run a query, and can be very helpful to optimize it.
A query plan is an intermediate representation of the code to actually run a query, and can be very helpful to optimize it.
Data layout is important for distributed data processing engines, because it is usually the first (and sometimes only) index for efficient data access.
K-means is a simple and popular way to cluster data points. It can easily be implemented in SQL, and can be parallelized well with almost no additional effort in a distributed database.
The join is probably the most important operator in SQL. Understanding its performance can often help a lot with understanding query performance in general.
From reading event log files, it is often not immediately obvious which events were happening in parallel. Gantt charts can be very helpful to visualize this information, but most libraries require to preprocess the data a bit to create dense charts.
Directed acyclic graphs are useful in many situations. Especially when displaying them, it can sometimes be useful to hide redundant edges. Fortunately there is a relatively easy way to find these edges. Even without SQL.
Boosted trees are as of now one of the more successful machine learning models. And since they are basically just a standard tree data structure, it's actually quite easy to convert them to a relational model and compute predictions in SQL.
Locality sensitive hashing is kinda the opposite of the more well-known cryptographic hash functions. It's aim is to hash similar items into the same buckets. While this would be quite a disaster for cryptographic applications, it allows for a nice approximation of nearest neighbors. That can even be implemented in SQL.
The HyperLogLog algorithm is a somewhat famous approximation for the distinct count aggregation that requires only very little memory. And while the accuracy analysis isn't necessarily all that obvious, the implementation is actually quite simple. Simple enough to be implemented directly in SQL.