Additional information
-
If no labelled data, we use an unsupervised learner with the syntax
CREATE ANOMALY DETECTION MODEL <model_name>
without specifying the target to predict. MindsDB then adds a column calledoutlier
when generating results. - If we have labelled data, we use the regular model creation syntax. There is backend logic that chooses between a semi-supervised algorithm (currently XGBOD) vs. a supervised algorithm (currently CatBoost).
- If multiple models are provided, then we create an ensemble and use majority voting.
- See the anomaly detection proposal document for more information.
Context about types of anomaly detection
- Supervised: we have inlier/outlier labels, so we can train a classifier the normal way. This is very similar to a standard classification problem.
- Semi-supervised: we have inlier/outlier labels and perform an unsupervised preprocessing step, and then a supervised classification algorithm.
- Unsupervised: we don’t have inlier/outlier labels and cannot assume all training data are inliers. These methods construct inlier criteria that will classify some training data as outliers too based on distributional traits. New observations are classified against these criteria. However, it’s not possible to evaluate how well the model detects outliers without labels.
Default dispatch logicWe propose the following logic to determine type of learning:
- Use supervised learning if labels are available and the dataset contains at least 3000 samples.
- Use semi-supervised learning if labels are available and number of samples in the dataset is less than 3000.
- If the dataset is unlabelled, use unsupervised learning.
Reasoning for default models on each typeWe refer to the NeurIPS AD Benchmark paper (linked above) to make these choices:
- For supervised learning, use CatBoost. It often outperforms classic algorithms.
- For semi-supervised, XGBod is a good default from PyOD.
- There’s no clear winner for unsupervised methods, it depends on the use case. ECOD is a sensible default with a fast runtime. If we’re not concerned about runtime, we can use an ensemble.
Prerequisites
Before proceeding, ensure the following prerequisites are met:- Install MindsDB locally via Docker or Docker Desktop.
- To use Anomaly Detection handler within MindsDB, install the required dependencies following this instruction.
Setup
Create an AI engine from the Anomaly Detection handler.anomaly_detection_engine
as an engine.