Anomaly Detection Handler
The Anomaly Detection handler implements supervised, semi-supervised, and unsupervised anomaly detection algorithms using the pyod, catboost, xgboost, and sklearn libraries. The models were chosen based on the results in the ADBench benchmark paper.
Additional information
-
If no labelled data, we use an unsupervised learner with the syntax
CREATE ANOMALY DETECTION MODEL <model_name>
without specifying the target to predict. MindsDB then adds a column calledoutlier
when generating results. -
If we have labelled data, we use the regular model creation syntax. There is backend logic that chooses between a semi-supervised algorithm (currently XGBOD) vs. a supervised algorithm (currently CatBoost).
-
If multiple models are provided, then we create an ensemble and use majority voting.
-
See the anomaly detection proposal document for more information.
Context about types of anomaly detection
-
Supervised: we have inlier/outlier labels, so we can train a classifier the normal way. This is very similar to a standard classification problem.
-
Semi-supervised: we have inlier/outlier labels and perform an unsupervised preprocessing step, and then a supervised classification algorithm.
-
Unsupervised: we don’t have inlier/outlier labels and cannot assume all training data are inliers. These methods construct inlier criteria that will classify some training data as outliers too based on distributional traits. New observations are classified against these criteria. However, it’s not possible to evaluate how well the model detects outliers without labels.
Default dispatch logic
We propose the following logic to determine type of learning:
- Use supervised learning if labels are available and the dataset contains at least 3000 samples.
- Use semi-supervised learning if labels are available and number of samples in the dataset is less than 3000.
- If the dataset is unlabelled, use unsupervised learning.
We’ve chosen 3000 based on the results of the NeurIPS AD Benchmark paper (linked above). The authors report that semi-supervised learning outperforms supervised learning when the number of samples used is less than 5% of the size of the training dataset. The average size of the training datasets in their study is 60,000, therefore this 5% corresponds to 3000 samples on average.
Reasoning for default models on each type
We refer to the NeurIPS AD Benchmark paper (linked above) to make these choices:
- For supervised learning, use CatBoost. It often outperforms classic algorithms.
- For semi-supervised, XGBod is a good default from PyOD.
- There’s no clear winner for unsupervised methods, it depends on the use case. ECOD is a sensible default with a fast runtime. If we’re not concerned about runtime, we can use an ensemble.
Prerequisites
Before proceeding, ensure the following prerequisites are met:
- Install MindsDB locally via Docker or Docker Desktop.
- To use Anomaly Detection handler within MindsDB, install the required dependencies following this instruction.
Setup
Create an AI engine from the Anomaly Detection handler.
Create a model using anomaly_detection_engine
as an engine.
Usage
To run example queries, use the data from this CSV file.
Unsupervised detection
Semi-supervised detection
Supervised detection
Specific model
Specific anomaly type
Ensemble
Was this page helpful?