Lightwood

Lightwood is the default AI engine used in MindsDB. It deals mainly with classification, regression, and time-series problems in machine learning. By providing it with the input data and problem definition, Lightwood generates predictions following three core steps that include Data pre-processing and cleaning, Feature engineering, and Model building and training. The input data ranges from numbers, dates, categories, text, quantities, arrays, matrices, up to images, audios, and videos (passed as URLs).

We recommend using the mindsdb/mindsdb:lightwood Docker image that comes with the Lightwood dependencies pre-installed. Learn more here.

How It Works

Here is the algorithm followed by Lightwood starting from the input data setup, through model building and training, up to getting predictions.

The input data is pre-processed and each column is assigned a data type. Next, data is converted into features via encoders that transform data into numerical representation used by the model. Finally, a predictive model takes the encoded feature data and outputs a prediction for the target. Under the hood, the model splits data into the training, validation, and testing sets, with ratios that are dynamic but usually an 80-10-10 ratio. The split is done by default using random sampling without replacement, stratified on the target column. Doing so, it determines the accuracy of the model by evaluating on the held out test set. Users can either use Lightwood’s default mixers/models or create their own approaches inherited from the BaseMixer class. To learn more about Lightwood philosophy, follow this link.

Accuracy Metrics

Lightwood provides ways to score the accuracy of the model using one of the accuracy functions. The accuracy functions include mean_absolute_error, mean_squared_error, precision_score, recall_score, and f1_score.

CREATE MODEL model_name
FROM data_source
  (SELECT * FROM table_name)
PREDICT target_column
USING
  accuracy_functions="['accuracy_function']";

You can define the accuracy function of choice in the USING clause of the CREATE MODEL statement.

Here are the accuracy functions used by default:

the r2_score value for regression predictions.
the balanced_accuracy_score value for classification predictions.
the complementary_smape_array_accuracy value for time series predictions.

The values vary between 0 and 1, where 1 indicates a perfect predictor, based on results obtained for a held-out portion of data (i.e. testing set).You can check accuracy values for models using the DESCRIBE statement.

Tuning the Lightwood ML Engine

Description

In MindsDB, the underlying AutoML models are based on the Lightwood engine by default. This library generates models automatically based on the data and declarative problem definition. But the default configuration can be overridden using the USING statement that provides an option to configure specific parameters of the training process. In the upcoming version of MindsDB, it will be possible to choose from more ML frameworks. Please note that the Lightwood engine is used by default.

Syntax

Here is the syntax:

CREATE MODEL project_name.model_name
FROM data_source
    (SELECT column_name, ... FROM table_name)
PREDICT target_column
USING parameter_key = 'parameter_value';

`encoders` Key

It grants access to configure how each column is encoded. By default, the AutoML engine tries to get the best match for the data.

...
USING encoders.column_name.module = 'value';

To learn more about encoders and their options, visit the Lightwood documentation page on encoders.

`model.args` Key

It allows you to specify the type of machine learning algorithm to learn from the encoder data.

...
USING model.args = {"key": value};

Here are the model options:

Model	Description
BaseMixer	It is a base class for all mixers.
LightGBM	This mixer configures and uses LightGBM for regression or classification tasks depending on the problem definition.
LightGBMArray	This mixer consists of several LightGBM mixers in regression mode aimed at time series forecasting tasks.
NHitsMixer	This mixer is a wrapper around an MQN-HITS deep learning model.
Neural	This mixer trains a fully connected dense network from concatenated encoded outputs of each feature in the dataset to predict the encoded output.
NeuralTs	This mixer inherits from Neural mixer and should be used for time series forecasts.
ProphetMixer	This mixer is a wrapper around the popular time series library Prophet.
RandomForest	This mixer supports both regression and classification tasks. It inherits from sklearn.ensemble.RandomForestRegressor and sklearn.ensemble.RandomForestClassifier.
Regression	This mixer inherits from scikit-learn’s Ridge class.
SkTime	This mixer is a wrapper around the popular time series library sktime.
Unit	This is a special mixer that passes along whatever prediction is made by the target encoder without modifications. It is used for single-column predictive scenarios that may involve complex and/or expensive encoders (e.g. free-form text classification with transformers).
XGBoostMixer	This mixer is a good all-rounder, due to the generally great performance of tree-based ML algorithms for supervised learning tasks with tabular data.

Please note that not all mixers are available in our cloud environment. In particular, LightGBM, LightGBMArray, NHITS, and Prophet.

To learn more about all the model options, visit the Lightwood documentation page on mixers.

`problem_definition.embedding_only` Key

To train an embedding-only model, use the below parameter when creating the model.

CREATE MODEL embedding_only_model
...
USING problem_definition.embedding_only = True;

The predictions made by this embedding model come in the form of embeddings by default. Alternatively, to get predictions in the form of embeddings from a normal model (that is, trained without specifying the problem_definition.embedding_only parameter), use the below parameter when querying this model for predictions.

SELECT predictions
...
USING return_embedding = True;

Other Keys Supported by Lightwood in JsonAI

The most common use cases of configuring predictors use encoders and model keys explained above. To see all the available keys, check out the Lightwood documentation page on JsonAI.

Example

Here we use the home_rentals dataset and specify particular encoders for some columns and a LightGBM model.

CREATE MODEL mindsdb.home_rentals_model
FROM example_db
    (SELECT * FROM demo_data.home_rentals)
PREDICT rental_price
USING
    encoders.location.module = 'CategoricalAutoEncoder',
    encoders.rental_price.module = 'NumericEncoder',
    encoders.rental_price.args.positive_domain = 'True',
    model.args = {"submodels": [
                    {"module": "LightGBM",
                     "args": {
                          "stop_after": 12,
                          "fit_on_dev": true
                          }
                    }
                ]};

Explainability

With Lightwood, you can deploy the following types of models:

regressions models,
classification models,
time-series models,
embedding models.

Predictions made by each type of model come with an explanation column, as below.

Regression

In the case of regression models, the target_explain column contains the following information:

{"predicted_value": 2951, "confidence": 0.99, "anomaly": null, "truth": null, "confidence_lower_bound": 2795, "confidence_upper_bound": 3107}

The upper and lower bounds are determined via conformal prediction, and correspond to the reported confidence score (which can be modified by the user).

Try it out following this tutorial.

Classification

In the case of classification models, the target_explain column contains the following information:

{"predicted_value": "No", "confidence": 0.6629213483146067, "anomaly": null, "truth": null, "probability_class_No": 0.8561, "probability_class_Yes": 0.1439}

The confidence score is produced by the conformal prediction module and is well-calibrated. On the other hand, the probability_class comes directly from the model logits, which may be uncalibrated. Therefore, the probability_class score may be optimistic or pessimistic, i.e. coverage is not guaranteed to empirically match the reported score.

Try it out following this tutorial.

Time-Series

In the case of time-series models, the target_explain column contains the following information:

{"predicted_value": 501618.3125, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 500926.109375, "confidence_upper_bound": 502310.515625}

Try it out following this tutorial.

Embeddings

In the case of embeddings models, the target_explain column contains the following information:

{"predicted_value": [1.0, 6.712956428527832, 1.247057318687439, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.3025851249694824, 0.5629426836967468, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7540000081062317, 0.3333333432674408, 0.35483869910240173, 0.9583333134651184, 0.7833333611488342, 0.25], "confidence": null, "anomaly": null, "truth": null}

Try it out following this tutorial.

You can visit the comprehensive Lightwood docs here.

Check out the Lightwood tutorials here.

Overview

Data Sources

Connection

Data Catalog

Bring Your Own Models

How It Works

Accuracy Metrics

Tuning the Lightwood ML Engine

Description

Syntax

`encoders` Key

`model.args` Key

`problem_definition.embedding_only` Key

Other Keys Supported by Lightwood in JsonAI

Example

Explainability

Overview

Data Sources

Connection

Data Catalog

Bring Your Own Models

​How It Works

​Accuracy Metrics

​Tuning the Lightwood ML Engine

​Description

​Syntax

​encoders Key

​model.args Key

​problem_definition.embedding_only Key

​Other Keys Supported by Lightwood in JsonAI

​Example

​Explainability

How It Works

Accuracy Metrics

Tuning the Lightwood ML Engine

Description

Syntax

`encoders` Key

`model.args` Key

`problem_definition.embedding_only` Key

Other Keys Supported by Lightwood in JsonAI

Example

Explainability