Oct 2018 Rational Exuberance

Boosted tree predictions in SQL

Computing predictions with a boosted tree model is usually not much more than a standard tree traversal. Not uncommonly the actual challenge is how to get the data for making decisions at each node. This often means exporting features from a database and loading them to the tree library, which can take more time than computing the predictions themselves. Pity.

Generally this is a pretty well known problem, with a standard solution: move the computation to where the data is. Which can can be done quite easily in this case, even if it means computing boosted tree predictions in SQL.

Let's take a look at the demo model from the XGBoost tutorial[1], which has the following representation:

booster[0]:
0:[f29<-9.53674316e-07] yes=1,no=2,missing=1
	1:[f56<-9.53674316e-07] yes=3,no=4,missing=3
		3:leaf=1.71217716
		4:leaf=-1.70044053
	2:[f109<-9.53674316e-07] yes=5,no=6,missing=5
		5:leaf=-1.94070864
		6:leaf=1.85964918
booster[1]:
0:[f60<-9.53674316e-07] yes=1,no=2,missing=1
	1:[f29<-9.53674316e-07] yes=3,no=4,missing=3
		3:leaf=0.78471756
		4:leaf=-0.968530357
	2:leaf=-6.23624468

The model consists of a set of two booster trees. Each tree has a set of regular nodes along with the root node, as well as a set of leaf nodes. Each regular node has an associated feature and a split value, followed by three links to a possible next node. Each leaf node has only one attribute: the prediction. We can represent this with the following table, where some attributes will be NULL for each node type:

CREATE TABLE tree_model (
  model_name    VARCHAR NOT NULL,
  booster_id    INTEGER NOT NULL,
  node_id       INTEGER NOT NULL,
  feature_index INTEGER,
  split         DOUBLE PRECISION,
  node_yes      INTEGER,
  node_no       INTEGER,
  node_missing  INTEGER,
  prediction    DOUBLE PRECISION
);
And the demo model from above translates to these INSERT statements:
INSERT INTO tree_model VALUES
  ('demo_model.xgb', 0, 0, 29, -9.53674316e-07, 1, 2, 1, NULL),
  ('demo_model.xgb', 0, 1, 56, -9.53674316e-07, 3, 4, 3, NULL),
  ('demo_model.xgb', 0, 2, 109, -9.53674316e-07, 5, 6, 5, NULL),
  ('demo_model.xgb', 0, 3, NULL, NULL, NULL, NULL, NULL, 1.71217716),
  ('demo_model.xgb', 0, 4, NULL, NULL, NULL, NULL, NULL, -1.70044053),
  ('demo_model.xgb', 0, 5, NULL, NULL, NULL, NULL, NULL, -1.94070864),
  ('demo_model.xgb', 0, 6, NULL, NULL, NULL, NULL, NULL, 1.85964918),

  ('demo_model.xgb', 1, 0, 60, -9.53674316e-07, 1, 2, 1, NULL),
  ('demo_model.xgb', 1, 1, 29, -9.53674316e-07, 3, 4, 3, NULL),
  ('demo_model.xgb', 1, 2, NULL, NULL, NULL, NULL, NULL, -6.23624468),
  ('demo_model.xgb', 1, 3, NULL, NULL, NULL, NULL, NULL, 0.78471756),
  ('demo_model.xgb', 1, 4, NULL, NULL, NULL, NULL, NULL, -0.968530357);

Now assuming the feature data for which we want to compute predictions is available in the following database table:

CREATE TABLE features (
  sample_id      INTEGER NOT NULL,
  feature_values INTEGER[126] NOT NULL -- We only need integers for the demo.
);

We can join the feature table to the model table, and start computing predictions:

WITH recursive tree_eval AS (
  -- The non-recursive term first selects all the root nodes.
  SELECT
    -- maintain sample_id, model_name and booster_id as index.
    f.sample_id,
    m.model_name,
    m.booster_id,
    CASE
      -- Choose the next node based on the split value of the root node.
      WHEN f.feature_values[m.feature_index] <  m.split THEN m.node_yes
      WHEN f.feature_values[m.feature_index] >= m.split THEN m.node_no
      WHEN f.feature_values[m.feature_index] IS NULL THEN m.node_missing
    END AS node_id,
    m.prediction -- In case the root node is already a leaf...
  FROM tree_model m
  CROSS JOIN features f -- Compute predictions for all samples.
  WHERE m.model_name = 'demo_model.xgb' -- But only for one specific model.
    AND m.node_id = 0 -- Start from the root nodes.

  UNION ALL -- We won't have duplicates, each level has one node per tree.

  -- The recursive term node now follows the tree by one level.
  SELECT
    f.sample_id,
    p.model_name,
    p.booster_id,
    CASE
      -- Again choose the next node base on the split value.
      WHEN f.feature_values[m.feature_index] <  m.split THEN m.node_yes
      WHEN f.feature_values[m.feature_index] >= m.split THEN m.node_no
      WHEN f.feature_values[m.feature_index] IS NULL THEN m.node_missing
    END AS node_id,
    m.prediction -- Will only be set for leaf nodes.
  FROM tree_eval p
  -- If the current node is a leaf node, the following join will
  -- return no rows, and therefore act as stop condition.
  JOIN tree_model m USING (model_name, booster_id, node_id)
  JOIN features f USING (sample_id)
)
-- Finally sum up the predictions from the different trees for the result
-- (according to the model objective function).
SELECT
  sample_id,
  (1 / (1 + exp(-sum(prediction)))) AS prediction
FROM tree_eval
WHERE prediction IS NOT NULL -- Only leaf nodes have prediction values.
GROUP BY sample_id
ORDER BY sample_id;
And that's it. We computed predictions from a boosted tree model directly in SQL. Demo code for translating GBM models to SQL as shown above can be found on GitHub[2].

Some additional notes:


References: