TimeEval: Evaluation Tool for Anomaly Detection Algorithms on Time Series¶
User Guides¶
This part of the TimeEval documentation includes a couple of usage guides to get you started on using TimeEval for your own projects. The guides teach you TimeEval’s APIs and their usage, but they do not get into detail about how TimeEval works. You can find the detailed descriptions of TimeEval concepts here.
Using TimeEval to evaluate algorithms¶
TimeEval is an evaluation tool for time series anomaly detection algorithms. We provide a large collection of compatible datasets and algorithms. The following section describes how you can set up TimeEval to perform your own experiments using the provided datasets and algorithms. The process consists of three steps: preparing the datasets, preparing the algorithms, and writing the experiment script.
Prepare datasets¶
This section assumes that you want to use the TimeEval datasets. If you want to use your own datasets with TimeEval, please read How to use your own datasets in TimeEval.
For the evaluation of time series anomaly detection algorithms, we collected univariate and multivariate time series datasets from various sources. We looked out for real-world as well as synthetically generated datasets with real-valued values and anomaly annotations. We included datasets with direct anomaly annotations (points or subsequences are labelled as either normal (0) or anomalous (1)) and indirect anomaly annotations. For the latter, we included datasets with categorical labels, where a single class (or low number of classes) is clearly underrepresented and can be assigned to unwanted, erroneous, or anomalous behavior. One example for this is an ECG signal with beat annotations, where most beats are annotated as normal beats, but some beats are premature or superventricular heart beats. The premature or superventricular heart beats can then be labelled as anomalous while the rest of the time series is normal behavior. Overall, we collected 1354 datasets (as of May 2022). For more details about the datasets, we refer you to the Datasets page of the repeatability website of our evaluation paper (doi:10.14778/3538598.3538602).
We grouped the datasets into 24 different dataset collection for easier download and management. The collections group datasets from a common source together, and you can download each dataset collection separately. Each dataset is thus identified by the tuple of collection name and dataset name.
TimeEval uses an index-File to discover datasets.
It contains the links to the time series data and summarizes metadata about them, such as number of anomalies, contamination, input dimensionality, support for supervised or semi-supervised training of algorithms, or the time series length.
The index-File (named datasets.csv
) for the paper’s benchmark datasets can be downloaded from the repeatability page as well.
Warning
The GutenTAG dataset collection comes with its own index-file!
The GutenTAG collection contains synthetically generated datasets using the GutenTAG dataset generator. It is compatible to TimeEval and generates TimeEval-compatible datasets and also the necessary metadata for the index-File.
The downloadable ZIP-archives contain the correct folder structure, but your extraction tool might place the contained files into a new folder that is named based on the ZIP-archive-name.
The idea is that you download the index-File (datasets.csv
) and just the dataset collections that you require, extract them all into the same folder, and then place the datasets.csv
there.
Please note or remember the name of your datasets folder.
We will need it later, and we will refer to it as <datasets-folder>
Example:
Scenario: You want to use the datasets from the CalIt2 and Daphnet collections.
Dataset download and folder structure:
# Download CalIt2.zip, Daphnet.zip and datasets.csv
$ mkdir timeeval-datasets
$ mv datasets.csv timeeval-datasets/
$ unzip CalIt2.zip -d timeeval-datasets
$ unzip Daphnet.zip -d timeeval-datasets
$ tree timeeval-datasets
timeeval-datasets
├── datasets.csv
└── multivariate
├── CalIt2
│ ├── CalIt2-traffic.metadata.json
│ └── CalIt2-traffic.test.csv
└── Daphnet
├── S01R01E0.metadata.json
├── S01R01E0.test.csv
├── S01R01E1.metadata.json
├── S01R01E1.test.csv
├── S01R02E0.metadata.json
├── S01R02E0.test.csv
├── [...]
├── S10R01E1.metadata.json
└── S10R01E1.test.csv
3 directories, 77 files
To examine multiple datasets at once, TimeEval configuration would be:
dm = MultiDatasetManager([Path("timeeval-datasets")])
datasets = []
datasets.extend(dm.select(collection="CalIt2"))
datasets.extend(dm.select(collection="Daphnet"))
#...
A list of dataset with respective download links are given in Datasets page.
Prepare algorithms¶
This section assumes that you want to use the TimeEval algorithms. If you want to integrate your own algorithm into TimeEval, please read How to integrate your own algorithm into TimeEval.
We collected over 70 time series anomaly detection algorithms and integrated them into TimeEval (as of May 2022).
All of our algorithm implementation make use of the DockerAdapter
to allow you to use all features of TimeEval with them (such as resource restrictions, timeout, and fair runtime measurements).
You can find the TimeEval algorithm implementations on GitHub: https://github.com/TimeEval/TimeEval-algorithms.
Using Docker images to bundle an algorithm for TimeEval also allows easy integration of new algorithms because there are no requirements regarding programming languages, frameworks, or tools.
Besides the many benefits, using Docker images to bundle algorithms makes preparing them for use with TimeEval a bit more cumbursome.
At the moment, we don’t have the capacity to publish and maintain the algorithm’s Docker images to a public Docker registry. This means that you have to build the Docker images from scratch before you can use the algorithms with TimeEval.
Note
If the community demand for pre-built TimeEval algorithm images rises, we will proudly assist in publishing and mainting publicly hosted images. However, this should be a community effort.
Please follow the following steps to prepare the algorithms to be evaluated with TimeEval. For further details about the Algorithm integration concept, please read the concept page Algorithms.
Clone or download the timeeval-algorithms repository
Build the base Docker image for your algorithm. You can find the image dependencies in the README-file of the repository. The base images are located in the folder
0-base-images
. Please make sure that you tag your image correctly (the image name must match theFROM
-clause in your algorithm image; this includes the image tag). To be sure, you can tag the images based on our naming scheme, which uses the prefixregistry.gitlab.hpi.de/akita/i/
.Optionally build an intermediate image, such as
registry.gitlab.hpi.de/akita/i/tsmp
, required for some algorithms.Build the algorithm image.
Repeat the above steps for all algorithms that you want to execute with TimeEval.
Creating a script to build all algorithm images is left as an exercise for the reader (tip: use find
to get the correct folder and image names, and iterate over them).
The README of the timeeval-algorithms repository contains example calls to test the algorithm Docker images.
Configure evaluation run¶
After we have prepared the datasets folder and the algorithm Docker images, we can install TimeEval and write an evaluation run script. You can install TimeEval from PyPI:
pip install TimeEval
We recommend to create a virtual environment with conda or virtualenv for TimeEval. The software requirements of TimeEval can be found on the home page.
When TimeEval ist installed, we can use its Python API to configure an evaluation run. We recommend to create a single Python-script for each evaluation run. The following snippet shows the main configuration options of TimeEval:
#!/usr/bin/env python3
from pathlib import Path
from timeeval import TimeEval, MultiDatasetManager, DefaultMetrics, Algorithm, TrainingType, InputDimensionality, ResourceConstraints
from timeeval.adapters import DockerAdapter
from timeeval.params import FixedParameters
from timeeval.resource_constraints import GB
def main():
####################
# Load and select datasets
####################
dm = MultiDatasetManager([
Path("<datasets-folder>") # e.g. ./timeeval-datasets
# you can add multiple folders with an index-File to the MultiDatasetManager
])
# A DatasetManager reads the index-File and allows you to access dataset metadata,
# the datasets itself, or provides utilities to filter datasets by their metadata.
# - select ALL available datasets
# datasets = dm.select()
# - select datasets from Daphnet collection
datasets = dm.select(collection="Daphnet")
# - select datasets with at least 2 anomalies
# datasets = dm.select(min_anomalies=2)
# - select multivariate datasets with a maximum contamination of 10%
# datasets = dm.select(input_dimensionality=InputDimensionality.MULTIVARIATE, max_contamination=0.1)
# limit to 5 datasets for this example
datasets = datasets[:5]
####################
# Load and configure algorithms
####################
# create a list of algorithm-definitions, we use a single algorithm in this example
algorithms = [Algorithm(
name="LOF",
main=DockerAdapter(
image_name="ghcr.io/timeeval/lof",
tag="0.3.0", # please use a specific tag instead of "latest" for reproducibility
skip_pull=True # set to True because the image is already present from the previous section
),
# The hyperparameters of the algorithm are specified here. If you want to perform a parameter
# search, you can also perform simple grid search with TimeEval using FullParameterGrid or
# IndependentParameterGrid.
param_config=FixedParameters({
"n_neighbors": 50,
"random_state": 42
}),
# required by DockerAdapter
data_as_file=True,
# You must specify the algorithm metadata here. The categories for all TimeEval algorithms can
# be found in their README or their manifest.json-File.
# UNSUPERVISED --> no training, SEMI_SUPERVISED --> training on normal data, SUPERVISED --> training on anomalies
# if SEMI_SUPERVISED or SUPERVISED, the datasets must have a corresponding training time series
training_type=TrainingType.UNSUPERVISED,
# MULTIVARIATE (multidimensional TS) or UNIVARIATE (just a single dimension is supported)
input_dimensionality=InputDimensionality.MULTIVARIATE
)]
####################
# Configure evaluation run
####################
# set the number of repetitions of each algorithm-dataset combination (e.g. for runtime measurements):
repetitions = 1
# set resource constraints
rcs = ResourceConstraints(
task_memory_limit = 2 * GB,
task_cpu_limit = 1.0,
)
# create TimeEval object and pass all the options
timeeval = TimeEval(dm, datasets, algorithms,
repetitions=repetitions,
resource_constraints=rcs,
# you can choose from different metrics:
metrics=[DefaultMetrics.ROC_AUC, DefaultMetrics.FIXED_RANGE_PR_AUC],
)
# With run(), you can start the evaluation.
timeeval.run()
# You can access the overall evaluation results with:
results = timeeval.get_results()
print(results)
# Detailed results are automatically stored in your current working directory at ./results/<datestring>.
if __name__ == "__main__":
main()
You can find more details about all exposed configuration options and methods in the API Reference.
If you are able to successfully execute the previous example evaluation, you can find more information at the following locations:
How to use your own datasets in TimeEval¶
You can use your own datasets with TimeEval. There are two ways to achieve this: using custom datasets or preparing your datasets as a TimeEval dataset collection. Either way, please familiarize yourself with the dataset format used by TimeEval described in the concept page Time series datasets.
1. Custom datasets¶
Important
The time series CSV-files must follow the TimeEval canonical file format!
To tell the TimeEval tool where it can find your custom datasets, a configuration file is needed.
The custom datasets config file contains all custom datasets organized by their identifier which is used later on.
Each entry in the config file must contain the path to the test time series;
optionally, one can add a path to the training time series, specify the dataset type, and supply the period size if known.
The paths to the data files must be absolute or relative to the configuration file.
Example file custom_datasets.json
:
{
"dataset_name": {
"test_path": "/absolute/path/to/data.csv"
},
"other_supervised_dataset": {
"test_path": "/absolute/path/to/test.csv",
"train_path": "./train.csv",
"type": "synthetic",
"period": 20
}
}
You can add custom datasets to the dataset manager using two ways:
from pathlib import Path
from timeeval import DatasetManager
from timeeval.constants import HPI_CLUSTER
custom_datasets_path = Path("/absolute/path/to/custom_datasets.json")
# Directly during initialization
dm = DatasetManager(HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK], custom_datasets_file=custom_datasets_path)
# Later on
dm = DatasetManager(HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK])
dm.load_custom_datasets(custom_datasets_path)
2. Create a TimeEval dataset collection¶
Warning
WIP
How to integrate your own algorithm into TimeEval¶
If your algorithm is written in Python, you could use our FunctionAdapter
(Example of using the FunctionAdapter
).
However, this comes with some limitations (such as no way to limit resource usage or setting timeouts).
We, therefore, highly recommend to use the DockerAdapter
.
This means that you have to create a Docker image for your algorithm before you can use it in TimeEval.
In the following, we assume that you want to create a Docker image with your algorithm to execute it with TimeEval. We provide base images for various programming languages. You can find them here.
Procedure¶
This section contains a short guide on how to integrate your own algorithm into TimeEval using the DockerAdapter
.
There are three main steps:
(i) Preparing the base image, (ii) creating the algorithm image, and (iii) using the algorithm image within TimeEval.
(i) Prepare the base image¶
Clone the
timeeval-algorithms
-repositoryBuild the selected base image from
0-base-images
. Please make sure that you tag your image correctly (the image name must match theFROM
-clause in your algorithm image).change to the
0-base-images
folder:cd 0-base-images
build your desired base image, e.g.
docker build -t ghcr.io/timeeval/python3-base:0.3.0 ./python3-base
(optionally: build derived base image, e.g.
docker build -t ghcr.io/timeeval/pyod:0.3.0 ./pyod
)now you can build your algorithm image from the base image (see next section)
Alternatively, you can also pull one of the existing base images from the registry, e.g.:
docker pull ghcr.io/timeeval/pyod:0.3.0
Note
Please contact the maintainers if there is no base image for your algorithm programming language or runtime.
(ii) Integrate your algorithm into TimeEval by creating an algorithm image¶
You can use any algorithm in the timeeval-algorithms
-repository as an example for that.
TimeEval uses a common interface to execute all its Docker algorithms. This interface describes data input and output as well as algorithm configuration. The calling-interface is described in TimeEval algorithm interface. Please read the section carefully and adapt your algorithm to the interface description. You could also create a wrapper script that takes care of the integration. Our canonical file format for time series datasets is described here. Once you are familiar with the concepts, you can adapt your algorithm and create its Docker image:
Create a
Dockerfile
for your algorithm that is based on your selected base image. Example:FROM ghcr.io/timeeval/python3-base:0.3.0 LABEL maintainer="sebastian.schmidl@hpi.de" ENV ALGORITHM_MAIN="/app/algorithm.py" # install algorithm dependencies COPY requirements.txt /app/ RUN pip install -r /app/requirements.txt # add algorithm implementation COPY algorithm.py /app/
Build your algorithm Docker image, e.g.
docker build -t my_algorithm:latest Dockerfile
Check if your algorithm is compatible to TimeEval.
Check if your algorithm can read a time series using our common file format.
Check if the algorithm parameters are correctly set using TimeEval’s call format.
Check if the anomaly scores are written in the correct format (an anomaly score value for each point of the original time series in a headerless CSV-file).
The README of the
timeeval-algorithms
-repository provides further details and instructions on how to create and test TimeEval algorithm images, including example calls.
(iii) Use algorithm image within TimeEval¶
Create an experiment script with your configuration of datasets and your own algorithm image.
Make sure that you specify your algorithm’s image and tag name correctly and use skip_pull=True
.
This prevents TimeEval from trying to update your algorithm image by fetching it from a Docker registry because your image was not published to any registry.
In addition, data_as_file
must also be enabled for all algorithms using the DockerAdapter
.
Please also specify the algorithm’s learning type (whether it requires training and which training data) and input dimensionality (uni- or multivariate):
#!/usr/bin/env python3
from pathlib import Path
from timeeval import TimeEval, MultiDatasetManager, Algorithm, TrainingType, InputDimensionality
from timeeval.adapters import DockerAdapter
from timeeval.params import FixedParameters
def main():
dm = MultiDatasetManager([Path("<datasets-folder>")])
datasets = dm.select()
####################
# Add your own algorithm
####################
algorithms = [Algorithm(
name="<my_algorithm>",
main=DockerAdapter(
image_name="<my_algorithm>",
tag="latest",
skip_pull=True # must be set to True because your image is just available locally
),
# Set your custom parameters:
param_config=FixedParameters({
"random_state": 42,
# ...
}),
# required by DockerAdapter
data_as_file=True,
# set the following metadata based on your algorithm:
training_type=TrainingType.UNSUPERVISED,
input_dimensionality=InputDimensionality.MULTIVARIATE
)]
timeeval = TimeEval(dm, datasets, algorithms)
timeeval.run()
results = timeeval.get_results()
print(results)
if __name__ == "__main__":
main()
Using custom evaluation metrics¶
Warning
This section is still work in progress.
Until we finish the documentation, please consult the API documentation of the base class
Metric
for a short explanation of the interface and an example.
Repetitive runs and measuring runtime¶
TimeEval has the ability to run an experiment multiple times to improve runtime measurements.
Therefore, timeeval.TimeEval
has the parameter repetitions: int = 1
, which tells TimeEval how many times to execute each experiment (algorithm, hyperparameters, and dataset combination).
When measuring runtime, we highly recommend to use TimeEval’s feature to limit each algorithm to a specific set of resources (meaning CPU and memory).
This requires using the timeeval.adapters.docker.DockerAdapter
for the algorithms.
See the concept page TimeEval configuration and resource restrictions for more details.
To retrieve the aggregated results, you can call timeeval.TimeEval.get_results()
with the parameter aggregated: bool = True
.
This aggregates the quality and runtime measurements using mean and standard deviation.
Erroneous experiments are excluded from an aggregate.
For example, if you have repetitions = 5
and one of five experiments failed, the average is built only over the 4 successful runs.
(Advanced) Distributed execution of TimeEval¶
Important
Before continuing with this guide, please make sure that you have read and understood this concept page.
TimeEval uses Dask’s SSHCluster to distribute tasks on a compute cluster. This means that certain prerequisites must be fulfilled before TimeEval can be executed in distributed mode.
We assume that the following requirements are already fulfilled for all hosts of the cluster (independent if the host has the driver, scheduler, or worker role):
Python 3 and Docker is installed
Every node has a virtual environment (Anaconda, virtualenv or similar) with the same name (e.g.
timeeval
) and prefix!The same TimeEval version is installed in all
timeeval
environments.All nodes can reach each other via network (especially via SSH).
Similar to the local execution of TimeEval, we also have to prepare the datasets and algorithms first.
Prepare time series datasets¶
Please create a dataset folder on each node using the same path. For example:
/data/timeeval-datasets
.Copy your datasets and also the index-file (
datasets.csv
) to all nodes.Test if TimeEval can access this folder and find your datasets on each node:
from timeeval import DatasetManager dmgr = DatasetManager("/data/timeeval-datasets", create_if_missing=False) dataset = dmgr.get(("<your-collection-name>", "<your-dataset-name>"))
Prepare algorithms¶
If you use plain Python functions as algorithm implementations and the FunctionAdapter
,
please make sure that your Python code is either installed as a module or that the algorithm implementation is part of your single script-file.
Your Python script with the experiment configuration is not allowed to import any other local files (e.g., from .util import xyz
).
This is due to issues with the Python-Path on the remote machines.
If you use Docker images for your algorithms and the DockerAdapter
,
the algorithm images must be present on all nodes or Docker must be able to pull them from a remote registry (can be controlled with skip_pull=False
).
There are different ways to get the Docker images to all hosts:
Build the Docker images locally on each machine (e.g., using a terminal multiplexer)
Build the Docker images on one machine and distribute them. This can be accomplished using image export and import. You can follow these rough outline of steps:
docker build
,docker image save
,rsync
to all machines,docker image import
Push / publish image to a registry available to you (if it’s public, you would be responsible for maintaining it)
At the end, TimeEval must be able to create the algorithms’ Docker containers, otherwise it is not able to execute and evaluate them.
TimeEval configuration for distributed execution¶
Setting up TimeEval for distributed execution follows the same principles as for local execution.
Two arguments to the TimeEval-constructor
are relevant for the distributed setup:
distributed: bool = False
and remote_config: Optional[RemoteConfiguration] = None
.
You can enable the distributed execution with distributed=True
and configure the cluster using the RemoteConfiguration
object.
The following snippet shows the available configuration options:
import sys
from timeeval import RemoteConfiguration
RemoteConfiguration(
scheduler_host = "localhost", # scheduler host
worker_hosts = [], # list of worker hosts
remote_python = sys.executable, # path to the python executable (same on all hosts)
dask_logging_file_level = "INFO", # logging level for the file-based logger
dask_logging_console_level = "INFO", # logging level for the console logger
dask_logging_filename = "dask.log", # filename for the file-based logger used for the Dask-logs
kwargs_overwrites = {} # advanced options for DaskSSHCluster
)
The driver host (executing TimeEval) must be able to open an SSH connection to all the other nodes using passwordless SSH, otherwise, TimeEval will not be able to reach the other nodes.
If you use resource constraints, please make sure that you set the number of tasks per hosts and the CPU und memory limits correctly. We highly discourage over-provisioning. For more details, see the concept page about resource restrictions.
(Advanced) Using hyperparameter heuristics¶
Warning
WIP
Concepts¶
Time series datasets¶
TimeEval uses a canonical file format for datasets. Existing datasets in another format must first be transformed into the canonical format before they can be used with TimeEval.
Canonical file format¶
TimeEval’s canonical file format is based on CSV.
Each file requires a header, cells (values) are separated by commas (decimal seperator is .
), and records are separated by newlines (unix-style LF: \n
).
The first column of the dataset is its index, either in integer- or datetime-format
(multiple timestamp-formats are supported but RFC 3339 is preferred, e.g. 2017-03-22 15:16:45.433502912
).
The index follows a single or multiple (if multivariate dataset) time series columns.
The last column contains the annotations, 0
for normal points and 1
for anomalies.
Usage of the timestamp
and is_anomaly
column headers is recommended.
timestamp,value,is_anomaly
0,12751.0,1
1,8767.0,0
2,7005.0,0
3,5257.0,0
4,4189.0,0
Registering datasets¶
TimeEval comes with its own collection of benchmark datasets (currently not included, download them from our website).
They can directly be used from the dataset manager DatasetManager
:
from pathlib import Path
from timeeval import DatasetManager
from timeeval.constants import HPI_CLUSTER
datasets_folder: Path = HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK] # or Path("./datasets-folder")
dm = DatasetManager(datasets_folder)
datasets = dm.select()
Custom datasets¶
Important
WIP!
Example file custom_datasets.json
:
{
"dataset_name": {
"test_path": "/absolute/path/to/data.csv"
},
"other_supervised_dataset": {
"test_path": "/absolute/path/to/test.csv",
"train_path": "./train.csv",
"type": "synthetic",
"period": 20
}
}
You can register custom datasets at the dataset manager using two ways:
from pathlib import Path
from timeeval import DatasetManager
from timeeval.constants import HPI_CLUSTER
custom_datasets_path = Path("/absolute/path/to/custom_datasets.json")
# Directly during initialization
dm = DatasetManager(HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK], custom_datasets_file=custom_datasets_path)
# Later on
dm = DatasetManager(HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK])
dm.load_custom_datasets(custom_datasets_path)
Preparing datasets for TimeEval¶
Datasets in different formats should be transformed in TimeEval’s canonical file format.
TimeEval provides a utility script to perform this transformation: scripts/preprocess_dataset.py
.
You can download this scrip from its GitHub repository.
A single dataset can be provided in two Numpy-readable text files. The first text file contains the data. The labels must be in a separate text file. Hereby, the label file can either contain the actual labels for each point in the data file or only the line indices of the anomalies. Example source data files:
Data file
12751.0
8767.0
7005.0
5257.0
4189.0
Labels file (actual labels)
1
0
0
0
0
Labels file (line indices)
3
4
The script scripts/preprocess_dataset.py
automatically generates the index column using an auto-incrementing integer value.
The integer value can be substituted with a corresponding timestamp (auto-incrementing value is used as a time unit, such as seconds s
or hours h
from the unix epoch).
See the tool documentation for further information:
python scripts/preprocess_dataset.py --help
Algorithms¶
Any algorithm that can be called with a numpy array as parameter and a numpy array as return value can be evaluated.
TimeEval also supports passing only the filepath to an algorithm and let the algorithm perform the file reading and parsing.
In this case, the algorithm must be able to read the TimeEval canonical file format.
Use data_as_file=True
as a keyword argument to the algorithm declaration.
The main
function of an algorithm must implement the timeeval.adapters.base.Adapter
-interface.
TimeEval comes with four different adapter types described in section Algorithm adapters.
Each algorithm is associated with metadata including its learning type and input dimensionality.
TimeEval distinguishes between the three learning types timeeval.TrainingType.UNSUPERVISED
(default),
timeeval.TrainingType.SEMI_SUPERVISED
, and timeeval.TrainingType.SUPERVISED
and the two input dimensionality definitions timeeval.InputDimensionality.UNIVARIATE
(default) and
timeeval.InputDimensionality.MULTIVARIATE
.
Registering algorithms¶
from timeeval import TimeEval, DatasetManager, Algorithm
from timeeval.adapters import FunctionAdapter
from timeeval.constants import HPI_CLUSTER
import numpy as np
def my_algorithm(data: np.ndarray) -> np.ndarray:
return np.zeros_like(data)
datasets = [("WebscopeS5", "A1Benchmark-1")]
algorithms = [
# Add algorithms to evaluate...
Algorithm(
name="MyAlgorithm",
main=FunctionAdapter(my_algorithm),
data_as_file=False,
)
]
timeeval = TimeEval(DatasetManager(HPI_CLUSTER.akita_dataset_paths[HPI_CLUSTER.BENCHMARK]), datasets, algorithms)
Algorithm adapters¶
Algorithm adapters allow you to use different algorithm types within TimeEval. The most basic adapter just wraps a python-function.
You can implement your own adapters. Example:
from typing import Optional
from timeeval.adapters.base import Adapter
from timeeval.data_types import AlgorithmParameter
class MyAdapter(Adapter):
# AlgorithmParameter = Union[np.ndarray, Path]
def _call(self, dataset: AlgorithmParameter, args: Optional[dict] = None) -> AlgorithmParameter:
# e.g. create another process or make a call to another language
pass
Function adapter¶
The timeeval.adapters.function.FunctionAdapter
allows you to use Python functions and methods as the algorithm
main code.
You can use this adapter by wrapping your function:
from timeeval import Algorithm
from timeeval.adapters import FunctionAdapter
from timeeval.data_types import AlgorithmParameter
import numpy as np
def your_function(data: AlgorithmParameter, args: dict) -> np.ndarray:
if isinstance(data, np.ndarray):
return np.zeros_like(data)
else: # data = pathlib.Path
return np.genfromtxt(data)[0]
Algorithm(
name="MyPythonFunctionAlgorithm",
main=FunctionAdapter(your_function),
data_as_file=False
)
Docker adapter¶
The timeeval.adapters.docker.DockerAdapter
allows you to run an algorithm as a Docker container.
This means that the algorithm is available as a Docker image.
This is the main adapter used for our evaluations.
Usage example:
from timeeval import Algorithm
from timeeval.adapters import DockerAdapter
Algorithm(
name="MyDockerAlgorithm",
main=DockerAdapter(image_name="algorithm-docker-image", tag="latest"),
data_as_file=True # important here!
)
Important
Using a DockerAdapter
implies that data_as_file=True
in the Algorithm
construction.
The adapter supplies the dataset to the algorithm via bind-mounting and does not support passing the data as numpy array.
Experimental algorithm adapters¶
The algorithm adapters in this section are prototypical implementations and not fully tested with TimeEval. Some adapters were used in earlier versions of TimeEval and are not compatible to it anymore.
Warning
The following algorithm adapters should be used for educational purposes only. They are not fully tested with TimeEval!
Distributed adapter¶
The timeeval.adapters.distributed.DistributedAdapter
allows you to execute an already distributed algorithm on multiple machines.
Supply a list of remote_hosts
and a remote_command
to this adapter.
It will use SSH to connect to the remote hosts and execute the remote_command
on these hosts before starting the main algorithm locally.
Important
Password-less ssh to the remote machines required!
Do not combine with the distributed execution of TimeEval (“TimeEval.Distributed” using
TimeEval(..., distributed=True)
)! This will affect the timing results.
Jar adapter¶
The timeeval.adapters.jar.JarAdapter
lets you evaluate Java algorithms in TimeEval.
You can supply the path to the Jar-File (executable) and any additional arguments to the Java-process call.
Adapter to apply univariate methods to multivariate data¶
The timeeval.adapters.multivar.MultivarAdapter
allows you to apply an univariate algorithm to each dimension of a multivariate dataset individually
and receive a single aggregated result.
You can currently choose between three different result aggregation strategies that work on single points:
If n_jobs > 1
, the algorithms are executed in parallel.
Algorithms provided with TimeEval¶
All algorithms that we provide with TimeEval use the DockerAdapter
as adapter-implementation to allow you to use all features of TimeEval with them (such as resource restrictions, timeout, and fair runtime measurements).
You can find the TimeEval algorithm implementations on Github: https://github.com/TimeEval/TimeEval-algorithms and can pull the images directly from the GitHub container registry.
Using Docker images to bundle an algorithm for TimeEval also allows easy integration of new algorithms because there are no requirements regarding programming languages, frameworks, or tools.
However, using Docker images to bundle algorithms makes preparing them for use with TimeEval a bit more cumbersome (cf. How to integrate your own algorithm into TimeEval).
We use GitHub Actions to automatically build and publish the algorithm Docker images for direct use within TimeEval.
In this section, we describe some important aspects of this architecture.
TimeEval base Docker images¶
To benefit from Docker layer caching and to reduce code duplication (DRY!), we decided to put common functionality in so-called base images. The following is taken care of by base images:
Provide system (OS and common OS tools)
Provide language runtime (e.g. python3, java8)
Provide common libraries / algorithm dependencies
Define volumes for IO
Define Docker entrypoint script (performs initial container setup before the algorithm is executed)
Currently, we provide the following root base images:
Name/Folder |
Image |
Usage |
---|---|---|
python2-base |
|
Base image for TimeEval methods that use python2 (version 2.7); includes default python packages. |
python3-base |
|
Base image for TimeEval methods that use python3 (version 3.7.9); includes default python packages. |
python36-base |
|
Base image for TimeEval methods that use python3.6 (version 3.6.13); includes default python packages. |
r4-base |
|
Base image for TimeEval methods that use R (version 4.0.5). |
java-base |
|
Base image for TimeEval methods that use Java (JRE 11.0.10). |
rust-base |
|
Base image for TimeEVal methods that use Rust (Rust 1.58). |
In addition to the root base images, we also provide some derived base images (intermediate images) that add further common functionality to the language runtimes:
Name/Folder |
Image |
Usage |
---|---|---|
tsmp |
|
Base image for TimeEval methods that use the matrix profile R package |
pyod |
|
Base image for TimeEval methods that are based on the |
timeeval-test-algorithm |
|
Test image for TimeEval tests that use docker; is based on |
python3-torch |
|
Base image for TimeEval methods that use python3 (version 3.7.9) and PyTorch (version 1.7.1); includes default python packages and torch; is based on |
You can find all current base images in the timeeval-algorithms
-repository under 0-base-images
and 1-intermediate-images
.
TimeEval algorithm interface¶
TimeEval uses a common interface to execute all the algorithms that implement the DockerAdapter
.
This means that the algorithms’ input, output, and parameterization is equal for all provided algorithms.
Execution and parametrization¶
All algorithms are executed by creating a Docker container using their Docker image and then executing it. The base images take care of the container startup and they call the main algorithm file with a single positional parameter. This parameter contains a String-representation of the algorithm configuration as JSON. Example parameter JSON (2022-08-18):
{
"executionType": 'train' | 'execute',
"dataInput": string, # example: "path/to/dataset.csv",
"dataOutput": string, # example: "path/to/results.csv",
"modelInput": string, # example: "/path/to/model.pkl",
"modelOutput": string, # example: "/path/to/model.pkl",
"customParameters": dict
}
Custom algorithm parameters¶
All algorithm hyperparameters described in the corresponding algorithm paper are exposed via the customParameters
configuration option.
This allows us to set those parameters from TimeEval.
Warning
TimeEval does not parse a manifest.json
file to get the custom parameters’ types and default values.
We expect the users of TimeEval to be familiar with the algorithms, so that they can specify the required parameters manually.
However, we require each algorithm to be executable without specifying any custom parameters (especially for testing purposes).
Therefore, please provide sensible default parameters for all custom parameters within the method’s code.
If you want to contribute your algorithm implementation to TimeEval, please add a manifest.json
-file to your algorithm anyway to aid the integration into other tools and for user information.
If your algorithm does not use the default parameters automatically and expects them to be provided, your algorithm will fail during runtime if no parameters are provided by the TimeEval user.
Input and output¶
Input and output for an algorithm is handled via bind-mounting files and folders into the Docker container.
All input data, such as the training dataset and the test dataset, are mounted read-only to the /data
-folder of the container.
The configuration options dataInput
and modelInput
reflect this with the correct path to the dataset (e.g. { "dataInput": "/data/dataset.test.csv" }
).
The dataset format follows our Canonical file format.
All output of your algorithm should be written to the /results
-folder.
This is also reflected in the configuration options with the correct paths for dataOutput
and modelOutput
(e.g. { "dataOutput": "/results/anomaly_scores.csv" }
).
The /results
-folder is also bind-mounted to the algorithm container - but writable -, so that TimeEval can access the results after your algorithm finished.
An algorithm can also use this folder to write persistent log and debug information.
Every algorithm must produce an anomaly scoring as output and put it at the location specified with the dataOutput
-key in the configuration.
The output file’s format is CSV-based with a single column and no header.
You can for example produce a correct anomaly scoring with NumPy’s numpy.savetxt
-function: np.savetxt(<args.dataOutput>, arr, delimiter=",")
.
Temporary files and data of an algorithm are written to the current working directory (currently this is /app
) or the temporary directory /tmp
within the Docker container.
All files written to those folders is lost after the algorithm container is removed.
Example calls¶
The following Docker command represents the way how the TimeEval DockerAdapter
executes your algorithm image:
docker run --rm \
-v <path/to/dataset.csv>:/data/dataset.csv:ro \
-v <path/to/results-folder>:/results:rw \
-e LOCAL_UID=<current user id> \
-e LOCAL_GID=<groupid of akita group> \
<resource restrictions> \
ghcr.io/timeeval/<your_algorithm>:latest execute-algorithm '{
"executionType": "execute",
"dataInput": "/data/dataset.csv",
"modelInput": "/results/model.pkl",
"dataOutput": "/results/anomaly_scores.ts",
"modelOutput": "/results/model.pkl",
"customParameters": {}
}'
This is translated to the following call within the container from the entry script of the base image:
docker run --rm \
-v <path/to/dataset.csv>:/data/dataset.csv:ro \
-v <path/to/results-folder>:/results:rw <...> \
ghcr.io/timeeval/<your_algorithm>:latest bash
# now, within the container
<python | java -jar | Rscript> $ALGORITHM_MAIN '{
"executionType": "execute",
"dataInput": "/data/dataset.csv",
"modelInput": "/results/model.pkl",
"dataOutput": "/results/anomaly_scores.ts",
"modelOutput": "/results/model.pkl",
"customParameters": {}
}'
Parameter configuration and search¶
Warning
WIP
TimeEval configuration and resource restrictions¶
Experiments¶
Important
WIP
You can configure which algorithms are executed on which datasets - to some degree.
Per default, TimeEval evaluates all algorithms on all datasets (cross product) skipping those combinations that are not compatible.
You can control which experiments are generated using the parameters skip_invalid_combinations
, force_training_type_match
, force_dimensionality_match
, and experiment_combinations_file
.
Different values for those flags allow you to achieve different goals.
Avoiding conflicting combinations¶
Not all algorithms can be executed on all datasets.
If the parameter skip_invalid_combinations
is set to True
, TimeEval will skip all invalid combinations of algorithms and datasets based on
their input dimensionality and training type.
It is automatically enabled if either force_training_type_match
or force_dimensionality_match
is set to True
(see next section).
Per default (force_training_type_match == force_dimensionality_match == False
), the following combinations
are not executed:
supervised algorithms on semi-supervised or unsupervised datasets (datasets cannot be used to train the algorithm)
semi-supervised algorithm on supervised or unsupervised datasets (datasets cannot be used to train the algorithm)
univariate algorithms on multivariate datasets (algorithm cannot process the dataset)
Forcing property matches¶
force_training_type_match
Narrow down the algorithm-dataset combinations further by executing an algorithm only on datasets with the same
training type, e.g. unsupervised algorithms only on unsupervised datasets.
This flag implies skip_invalid_combinations==True
.
force_dimensionality_match
Narrow down the algorithm-dataset combinations furthter by executing an algorithm only on datasets with the same
input dimensionality, e.g. multivariate algorithms only on multivariate datasets.
This flag implies skip_invalid_combinations==True
.
Selecting specific combinations¶
You can use the parameter experiment_combinations_file
to supply a path to an experiment combinations CSV-File.
Using this file, you can specify explicitly which combinations of algorithms, datasets, and hyperparameters should be executed.
The file should contain CSV data with a single header line and four columns with the following names:
algorithm
- name of the algorithmcollection
- name of the dataset collectiondataset
- name of the datasethyper_params_id
- ID of the hyperparameter configuration
Only experiments that are present in the TimeEval configuration and this file are scheduled and executed. This allows you to circumvent the cross-product that TimeEval will perform in its default configuration.
Resource restrictions¶
The competitive evaluation of algorithms requires that all algorithms are executed in the same (or at least very similar) execution environment. This means that no algorithm should have an unfair advantage over the other algorithms by having more time, memory, or other compute resource available.
TimeEval ensures comparable execution environments for all algorithms by executing algorithms in isolated Docker containers and controlling their resources.
When configuring TimeEval, you can specify resource limits that apply to all evaluated algorithms in the same way.
This includes the number of CPUs, main memory, training, and execution time limits.
All those resources can be specified using a timeeval.ResourceConstraints
object.
Important
To make use of resource restrictions, all evaluated algorithms must be registered using the timeeval.adapters.docker.DockerAdapter
.
It is the only adapter implementation that can deal with resource constraints.
All other adapters ignore them.
TimeEval will raise an error if you try to use resource restrictions and non-DockerAdapter
-based algorithms at the same time.
Time limits¶
Some algorithms are not suitable for very large datasets and, thus, can take a long time until they finish either training or testing.
For this reason, TimeEval uses timeouts to restrict the runtime of all (or selected) algorithms.
You can change the timeout values for the training and testing phase globally using configuration options in
timeeval.ResourceConstraints
:
from durations import Duration
from timeeval import TimeEval, ResourceConstraints
limits = ResourceConstraints(
train_timeout=Duration("2 hours"),
execute_timeout=Duration("2 hours"),
)
timeeval = TimeEval(dm, datasets, algorithms,
resource_constraints=limits
)
...
It’s also possible to use different timeouts for specific algorithms if they run using the DockerAdapter
.
The DockerAdapter
class can take in a timeout
parameter that defines the maximum amount of time the algorithm is allowed to run.
The parameter takes in a durations.Duration
object as well, and overwrites the globally set timeouts.
If the timeout is exceeded, a timeeval.adapters.docker.DockerTimeoutError
is raised and the specific algorithm for the current dataset is cancelled.
CPU and memory limits¶
To facilitate a fair comparison of algorithms, you can configure TimeEval to restrict the execution of algorithms to specific resources. At the moment, TimeEval supports limiting the number of CPUs and the available memory. GPUs are not supported by TimeEval. The resource constraints are enforced using explicit resource limits on the Docker containers.
In the following example, we limit each algorithm to 1 CPU core and a maximum of 3 GB of memory:
from timeeval import ResourceConstraints
from timeeval.resource_constraints import GB
limits = ResourceConstraints(
task_memory_limit = 3 * GB,
task_cpu_limit = 1.0
)
If TimeEval is executed on a distributed cluster, it assumes a homogenous cluster, where all nodes of the cluster have the same capabilities and resources. There are two options to configure resource limits for distributed TimeEval:
automatically: Per default, TimeEval will distribute the available resources of each node evenly to all parallel evaluation tasks. By changing the number of tasks per hosts, you can, thus, easily control the available resources to the evaluation tasks without worrying about over-provisioning.
Because each tasks, in effect, trains or executes a time series anomaly detection algorithm, the tasks are resource-intensive, over-provisioning should be prevented. It could decrease overall performance and distort runtime measurements.
However, if your compute cluster is not homogenous, TimeEval will assign different resource limits to algorithms depending on the node where the algorithm is executed on.
Example: Non-homogenous cluster of 2 nodes with the first node \(A\) having 10 cores and 30 GB of memory and the second node \(B\) having 20 cores and 60 GB of memory. When setting the limits to
ResourceConstraints(tasks_per_hosts=10)
, algorithms executed on node \(A\) will get 1 CPU and 3 GB of memory and algorithms executed on node \(B\) will get 2 CPUs and 20 GB of memory. Therefore, always use explicit resource constraints for non-homogeneous clusters.explicitly: To make sure that all algorithms have the same resource constraints and there is no over-provisioning, you should set the CPU und memory limits explicitly. For non-homogenous cluster take the node with the lowest overall resources and decide on how many task you want to execute in parallel. Then divide the available resources by the number of tasks and fix the resource limits for all algorithms to these numbers.
Example: The same non-homogenous cluster with nodes \(A\) and \(B\) with 10 tasks per node (host), would result in the following constraints:
rcs = ResourceConstraints( tasks_per_host=10, task_memory_limit = 3 * GB, task_cpu_limit = 1.0, )
TimeEval results¶
On configuring and executing TimeEval, TimeEval applies the algorithms with their configured hyperparameter values on all the datasets.
It measures the algorithms’ runtimes and checks their effectiveness using evaluation measures (metrics).
The results are stored in a summary file called results.csv
and a nested folder structure in the results-folder (./results/<timestamp>
per default).
The output directory has the following structure:
results/<timestamp>/
├── results.csv
├── <algorithm_1>/<hyper_params_id>/
| ├── <collection_name>/<dataset_name_1>/<repetition_number>/
│ | ├── raw_anomaly_scores.ts
│ | ├── anomaly_scores.ts
│ | ├── docker-algorithm-scores.csv
│ | ├── execution.log
│ | ├── hyper_params.json
│ | └── metrics.csv
| └── <collection_name>/<dataset_name_2>/<repetition_number>/
│ ├── raw_anomaly_scores.ts
│ ├── anomaly_scores.ts
│ ├── docker-algorithm-scores.csv
│ ├── execution.log
| ├── model.pkl
│ ├── hyper_params.json
│ └── metrics.csv
└── <algorithm_2>/<hyper_params_id>/
└── <collection_name>/<dataset_name_1>/<repetition_number>/
├── raw_anomaly_scores.ts
├── anomaly_scores.ts
├── docker-algorithm-scores.csv
├── execution.log
├── hyper_params.json
└── metrics.csv
We provide a description of each file below.
Summary file (result.csv
)¶
For a given dataset, different algorithms with varying hyperparameters yield distinct results.
The file result.csv
provides an overview of the evaluation run and contains the following attributes:
Column Name |
Datatype |
Description |
---|---|---|
algorithm |
str |
name of the algorithm as defined in |
collection |
str |
name of the dataset collection. A collection contains similar datasets. |
dataset |
str |
name of the dataset |
algo_training_type |
str |
specifies, whether a dataset has a training time series with anomaly labels (supervised), with normal data only (semi-supervised), or no training time series at all (unsupervised) |
algo_input_dimensionality |
str |
specifies if the dataset has multiple channels (multivariate) or not (univariate) |
dataset_training_type |
str |
specifies, whether an algorithm requires training data with anomalies (supervised), without normal data only (semi-supervised), or does not require training data (unsupervised) |
dataset_input_dimensionality |
str |
univariate or multivariate (see above) |
train_preprocess_time |
float |
runtime of the preprocessing step during training in seconds |
train_main_time |
float |
runtime of the training in seconds (does not include pre-processing time) |
execute_preprocess_time |
float |
runtime of the preprocessing step during execution in seconds |
execute_main_time |
float |
runtime of the execution of the algorithm on the test time series in seconds (does not include pre- or post-processing times) |
execute_postprocess_time |
float |
runtime of the post-processing step during execution |
status |
str |
specifies, whether the algorithm executed successfully ( |
error_message |
str |
optional detailed error message |
repetition |
int |
repetition number if a dataset-hyperparameter-dataset combination was executed multiple times |
hyper_params |
float |
actual hyperparameter values for this execution |
hyper_params_id |
float |
alphanumerical hash of the hyperparameter configuration |
metric_1 |
float |
value of the first performance metric |
… |
… |
… |
Directory (<algorithm_1>/<hyper_params_id>/<collection_name>/<dataset_name_1>/<repetition_number>/
)¶
For every experiment in the configured evaluation run, TimeEval creates a new directory in the result folder. It stores all the results and temporary files for this single combination of dataset, algorithm, algorithm hyperparameter values, and repetition number. The directories are structured in nested folders named by first the algorithm name, followed by the ID of the hyperparameter settings, the dataset collection name, the dataset name, and finally the repetition number. Each experiment directory contains at least the following files:
raw_anomaly_scores.ts
: The raw anomaly scores produced by the algorithm after the post-processing function was executed. The file contains no header and a single floating point value in each row for each time step of the input time series. The value range depends on the algorithm.anomaly_scores.ts
: Normalized anomaly scores. The value range is from 0 (normal) to 1 (most anomalous).execution.log
: Unstructured log-file of the experiment execution. Contains debugging information from the Adapter, the algorithm, and the metric calculation. If an algorithm fails, its error message usually appear in this log.metrics.csv
: This file lists the metric and runtime measurements for the corresponding experiment. The used metrics are defined by the user. Find more information in the API documentation: timeeval.metrics packagehyper_params.json
: Contains a JSON-object with the hyperparameter values used to execute the algorithm on the dataset. If hyperparameter heuristics were defined, the heuristic’ values are already resolved.
All other files are optional and depend on the used algorithm Adapter
.
For example, the DockerAdapter
usually produces a temporary file called docker-algorithm-scores.csv
to pass the algorithm result from the Docker container to TimeEval, and (semi-)supervised algorithms store their trained model in model.pkl
-files.
Metrics¶
Distributed TimeEval¶
TimeEval is able to run multiple experiments in parallel and distributedly on a cluster of multiple machines (hosts).
You can enable the distributed execution of TimeEval by setting distributed=True
when creating the TimeEval
object.
The cluster configuration can be managed using a timeeval.RemoteConfiguration
object passed to the remote_config
argument.
Distributed TimeEval will use your supplied configuration of algorithms, parameters, and datasets to create a list of experiments or evaluation tasks. It then schedules the execution of these tasks to the nodes in the cluster so that the full cluster can be utilized for the evaluation run. During the run, the main process monitors the execution of the tasks. At the end of the run, it collects the evaluation results, so that you can use them for further analysis.
Host roles¶
TimeEval uses Dask's SSHCluster
to dynamically create a logical cluster on a list of hosts specified by IP-addresses or hostnames.
According to Dask’s terminology, TimeEval also distinguishes between the host roles scheduler and worker.
In addition, there also is a driver role.
The following table summarizes the host roles:
Host Role |
Description |
---|---|
driver |
Host that runs the experiment script (where |
scheduler |
Host that runs the |
worker |
Host that runs one or multiple |
The driver host role is implicit.
The host, on which you create the TimeEval
-object, gets the driver role.
In the distributed mode, this main Python-process does not execute any evaluation jobs.
The scheduler and worker roles can be assigned to a host using the RemoteConfiguration
object.
The driver host could be a local notebook or computer, while the scheduler and worker hosts are part of the cluster. A single host can have multiple roles at the same time, and for most use cases this is totally fine. Usually, a single machine of a cluster is used as driver, scheduler, and worker, while all the other machines just get the worker role. That is typically not a problem because the driver and scheduler components do not use many resources, and we can, therefore, use the computing power of the first host much more efficiently.
Example:
Assume that we have a cluster of 3 nodes (node1, node2, node3) and that we start a TimeEval experiment on node1 with the following configuration:
from timeeval import RemoteConfiguration
RemoteConfiguration(
scheduler_host="node1",
worker_hosts=["node1", "node2", "node3"]
)
In this case, node1 executes TimeEval (role driver), hosts the Dask scheduler (role scheduler), and participates in the execution of evaluation jobs (role worker). It has all three roles. node2 and node3, however, are pure work horses and participate in the execution of evaluation jobs only (role worker).
Distributed execution¶
If TimeEval is started with distributed=True
, it automatically starts a Dask SSHCluster on the specified scheduler and worker hosts.
This is done via simple SSH-connections to the machines.
It then uses the passed experiment configurations to create evaluation jobs (called Experiment
s).
Each Experiment
consists of an algorithm, its hyperparameters, a dataset, and a repetition number.
After all Experiment
s have been generated and validated, they are sent to the Dask scheduler and put into a task queue.
The workers pull the tasks from the scheduler and perform the evaluation (i.a., executing the Docker containers of the algorithm).
All results and temporary data are stored on the disk of the local node and the overall evaluation result is sent back to the scheduler.
The driver host periodically polls the scheduler for the results and collects them in memory.
When all tasks have been processed, the driver uses SSH connections again to pull all the temporary data from the worker nodes.
This populates the local results
-folder.
Cluster requirements¶
Because we use a Dask SSHCluster
to manage the cluster hosts, there are additional requirements for every cluster node.
Please ensure that your cluster setup meets the following requirements:
Every node must have Python and Docker installed.
The algorithm images must be present on all nodes or Docker must be able to pull them (if
skip_pull=False
).Every node uses the same Python environment (the path to the Python-binary must be the same) and has TimeEval installed in it.
The whole datasets’ folder must be present on all nodes at the same path. This means
DatasetManager("path/to/datasets-folder", create_if_missing=False)
must work on all nodes.Your Python script with the experiment configuration does not import any other local files (e.g.,
from .util import xyz
).All hosts must be able to reach each other via network.
The driver host must be able to open a SSH connection to all the other nodes using passwordless SSH. For this, please confirm that you can run
ssh <remote_host>
without any (password-)prompt; otherwise, TimeEval will not be able to reach the other nodes. (https://www.google.com/search?q=passwordless+SSH)
API Reference¶
This section documents the public API of TimeEval.
timeeval package¶
An Evaluation Tool for Anomaly Detection Algorithms on Time Series Data.
How to use the documentation:
Documentation is available in two forms: docstrings provided with the code, and a loose standing reference guide, available on timeeval.readthedocs.io.
Code snippets in the docstring examples are indicated by three greater-than signs:
>>> x = 42
>>> x = x + 1
Use the built-in help
function to view a function’s or class’ docstring:
>>> from timeeval import TimeEval
>>> help(TimeEval)
...
timeeval.TimeEval¶
- class timeeval.TimeEval(dataset_mgr: Datasets, datasets: List[Tuple[str, str]], algorithms: List[Algorithm], results_path: Path = PosixPath('results'), repetitions: int = 1, distributed: bool = False, remote_config: Optional[RemoteConfiguration] = None, resource_constraints: Optional[ResourceConstraints] = None, disable_progress_bar: bool = False, metrics: Optional[List[Metric]] = None, skip_invalid_combinations: bool = True, force_training_type_match: bool = False, force_dimensionality_match: bool = False, n_jobs: int = - 1, experiment_combinations_file: Optional[Path] = None, module_configs: Mapping[str, Any] = {})¶
Main class of TimeEval.
This class is the main utility to configure and execute evaluation experiments. First select your algorithms and datasets and, then, pass them to TimeEval and use its constructor arguments to configure your evaluation run. Per default, TimeEval evaluates all algorithms on all datasets (cross product). You can use the parameters
skip_invalid_combinations
,force_training_type_match
,force_dimensionality_match
, andexperiment_combinations_file
to control which algorithm runs on which dataset. See the description of the other arguments for further configuration details.After you have created your TimeEval object, holding the experiment run configuration, you can execute the experiments by calling
run()
. Afterward, the evaluation summary results are accessible in theresults_path
and fromget_results()
.Examples
Simple example experiment evaluating a single algorithm on the test datasets using the default metrics (just
ROC_AUC
):>>> from timeeval import TimeEval, DefaultMetrics, Algorithm, TrainingType, InputDimensionality, DatasetManager >>> from timeeval.adapters import DockerAdapter >>> from timeeval.params import FixedParameters >>> >>> dm = DatasetManager(Path("tests/example_data")) >>> datasets = dm.select() >>> >>> algorithms = [ >>> Algorithm( >>> name="COF", >>> main=DockerAdapter(image_name="ghcr.io/timeeval/cof"), >>> param_config=FixedParameters({"n_neighbors": 20, "random_state": 42}), >>> data_as_file=True, >>> training_type=TrainingType.UNSUPERVISED, >>> input_dimensionality=InputDimensionality.MULTIVARIATE >>> ), >>> ] >>> >>> timeeval = TimeEval(dm, datasets, algorithms, metrics=DefaultMetrics.default_list()) >>> timeeval.run() >>> results = timeeval.get_results(aggregated=False) >>> print(results)
- Parameters
dataset_mgr (
Datasets
) – The dataset manager provides the metadata about the datasets. You can either use aDatasetManager
or aMultiDatasetManager
.datasets (
List[Tuple[str
,str]]
) – List of dataset IDs consisting of collection name and dataset name to uniquely identify each dataset. The datasets must be known by thedataset_mgr
. You can callselect()
on thedataset_mgr
to get a list of dataset IDs.algorithms (
List[Algorithm]
) – List of algorithms to evaluate on the datasets. The algorithm specification also contains the hyperparameter configurations that TimeEval will test.results_path (
Path
) – Use this parameter to change the path where all evaluation results are stored. If TimeEval is used in distributed mode, this path will be created on all nodes!repetitions (
int
) – Execute each unique combination of dataset, algorithm, and hyperparameter-setting multiple times. This allows you to use TimeEval to measure runtimes more precisely by aggregating the runtime measurements over multiple repetitions.distributed (
bool
) – Run TimeEval in distributed mode. In this case, you should also supply aremote_config
.remote_config (
Optional[RemoteConfiguration]
) – Configuration of the Dask cluster used for distributed execution of TimeEval. SeeRemoteConfiguration
for details.resource_constraints (
Optional[ResourceConstraints]
) –You can supply a
ResourceConstraints
-object to limit the amount of (CPU, memory, or runtime) resources available to each experiment. These options apply to each experiment to ensure a fair comparison.Warning
Resource constraints are currently only implemented by the
DockerAdapter
. If you rely on resource constraints, please make sure that all algorithms use theDockerAdapter
-implementation.disable_progress_bar (
bool
) – Enable / disable showing the tqdm progress bars.metrics (
Optional[List[Metric]]
) – Supply a list ofMetric
to evaluate the algorithms with. TimeEval computes all supplied metrics over all experiments. If you don’t specify any metric (None
), the default metric listdefault_list()
is used instead.skip_invalid_combinations (
bool
) –Not all algorithms can be executed on all datasets. If this flag is set to
True
, TimeEval will skip all invalid combinations of algorithms and datasets based on their input dimensionality and training type. It is automatically enabled if eitherforce_training_type_match
orforce_dimensionality_match
is set toTrue
. Per default (force_training_type_match == force_dimensionality_match == False
), the following combinations are not executed:supervised algorithms on semi-supervised or unsupervised datasets (datasets cannot be used to train the algorithm)
semi-supervised algorithm on supervised or unsupervised datasets (datasets cannot be used to train the algorithm)
univariate algorithms on multivariate datasets (algorithm cannot process the dataset)
force_training_type_match (
bool
) – Narrow down the algorithm-dataset combinations further by executing an algorithm only on datasets with the same training type, e.g. unsupervised algorithms only on unsupervised datasets. This flag impliesskip_invalid_combinations==True
.force_dimensionality_match (
bool
) – Narrow down the algorithm-dataset combinations furthter by executing an algorithm only on datasets with the same input dimensionality, e.g. multivariate algorithms only on multivariate datasets. This flag impliesskip_invalid_combinations==True
.n_jobs (
int
) – Set the number of jobs / processes used to fetch the results from the remote machine. This setting is used only in distributed mode.-1
instructs TimeEval to use all locally available cores.experiment_combinations_file (
Optional[Path]
) –Supply a path to an experiment combinations CSV-File. Using this file, you can specify explicitly which combinations of algorithms, datasts, and hyperparameters should be executed. The file should contain CSV data with a single header line and four columns with the following names:
algorithm - name of the algorithm
collection - name of the dataset collection
dataset - name of the dataset
hyper_params_id - ID of the hyperparameter configuration
Only experiments that are present in the TimeEval configuration and this file are scheduled and executed. This allows you to circumvent the cross-product that TimeEval will perform in its default configuration.
module_configs (
Mapping[str
,Any]
, optional) –Use this parameter to pass additional configuration options for automatically loaded TimeEval modules. This is currently used only for the implementation of the Bayesian hyperparameter optimization prozedure using Optuna. See
timeeval.integration.optuna.OptunaModule
andtimeeval.params.bayesian.OptunaParameterSearch
for details.You can access loaded modules via the
modules
attribute (Dict[str, TimeEvalModule
) of the TimeEval instance, e.g.timeeval.modules["optuna"]
.
- DEFAULT_RESULT_PATH = PosixPath('results')¶
Default path for the results.
If you don’t specify the
results_path
, TimeEval will store the evaluation results in the folderresults
within the current working directory.
- RESULT_KEYS = ['algorithm', 'collection', 'dataset', 'algo_training_type', 'algo_input_dimensionality', 'dataset_training_type', 'dataset_input_dimensionality', 'train_preprocess_time', 'train_main_time', 'execute_preprocess_time', 'execute_main_time', 'execute_postprocess_time', 'status', 'error_message', 'repetition', 'hyper_params', 'hyper_params_id']¶
This list contains all the _fixed_ result data frame’s column headers. TimeEval dynamically adds the metrics and execution times depending on its configuration.
For metrics, their
name()
will be used as column header, and TimeEval will add the following runtime measurements depending on whether they are applicable to the algorithms in the run or not:train_preprocess_time: if
preprocess()
is definedtrain_main_time: if the algorithm is semi-supervised or supervised
execute_preprocess_time: if
preprocess()
is definedexecute_main_time: always
execute_postprocess_time: if
postprocess()
is defined
- get_results(aggregated: bool = True, short: bool = True) DataFrame ¶
Return the (aggregated) evaluation results of a previous evaluation run.
The results are returned in a Pandas
DataFrame
and contain the mean runtime and metrics of the algorithms for each dataset. You can tweak the output using the parameters.Note
Must be called after
run()
, otherwise the returned DataFrame is empty.- Parameters
aggregated (
bool
) – IfTrue
, returns the aggregated results (controled by parametershort
), otherwise all collected information is returned.short (
bool
) – This parameter is used only in aggregation mode and controls the aggregation level and functions. IfTrue
, the aggregation is over algorithms and datasets, and the mean of the metrics, training time, and execution time is returned. IfFalse
, the aggregation is over algorithms, datasets, and parameter combinations, and the mean and standard deviation of all runtime measurements and metrics are computed.
- Return type
DataFrame
containing the evaluation results.
- rsync_results() None ¶
Fetches the evaluation results of the current evaluation run from all remote machines merging the temporary data and results together on the local host. This method is automatically executed by TimeEval at the end of an evaluation run started by calling
run()
.See also
- static rsync_results_from(results_path: Path, hosts: List[str], disable_progress_bar: bool = False, n_jobs: int = - 1) None ¶
Fetches evaluation results of an independent TimeEval run from remote machines merging the temporary data and results together on the local host.
- Parameters
results_path (
Path
) – Path to the evaluation results. Must be the same for all hosts.hosts (
List[str]
) – List of hostnames or IP addresses that took part in the evaluation run.disable_progress_bar (
bool
) – If a progress bar should be displayed or not.n_jobs (
int
) – Number of parallel processes used to fetch the results. The parallelism is limited by the number of external hosts and the maximum number of available CPU cores.
- run() None ¶
Starts the configured evaluation run.
Each TimeEval run consists of a number of experiments that are executed independently of each other. There are three phases: PREPARE, EVALUATION, FINALIZE.
_PREPARE_ phase: In the first phase, the execution environment is prepared, the result folder is created, and algorithm adapter-dependent preparation steps, such as pulling Docker images for the
DockerAdapter
, are executed._EVALUATION_ phase: In the evaluation phase, the experiments are executed and the results are recorded and stored to disk.
_FINALIZE_ phase: In the last phase, the execution environment is cleaned up, and algorithm adapter-dependent finalization steps, such as removing the temporary Docker containers for the
DockerAdapter
, are executed.
This method executes all three phases after each other and returns after they are finished. You can access the evaluation results either using
get_results()
programmatically or in the results folder from the file system.
- save_results(results_path: Optional[Path] = None) None ¶
Store the evaluation results to a CSV-file in the provided results_path. This method is automatically executed by TimeEval at the end of an evaluation run when calling
run()
.- Parameters
results_path (
Optional[Path]
) – Path, where the results should be stored at. If it is not supplied, the results path of the current TimeEval run (timeeval.TimeEval.results_path
) is used.
timeeval.Status¶
timeeval.Algorithm¶
- class timeeval.Algorithm(name: str, main: Adapter, preprocess: Optional[TSFunction] = None, postprocess: Optional[TSFunctionPost] = None, data_as_file: bool = False, param_schema: Dict[str, Dict[str, Any]] = <factory>, param_config: ParameterConfig = <timeeval.params.base.FixedParameters object>, training_type: TrainingType = TrainingType.UNSUPERVISED, input_dimensionality: InputDimensionality = InputDimensionality.UNIVARIATE)¶
This class is a wrapper for any Adapter and an instruction plan for the TimeEval tool. It tells TimeEval what algorithm to execute, what pre- and post-steps to perform and how the parameters and data are provided to the algorithm. Moreover, it defines attributes that are necessary to help TimeEval know what kind of time series can be put into the algorithm.
- Parameters
name (
str
) – The name of the Algorithm shown in the results.main (
timeeval.adapters.base.Adapter
) – The adapter implementation that contains the algorithm to evaluate.preprocess (
Optional[TSFunction]
) – Optional function to perform beforemain
to modify input data.postprocess (
Optional[TSFunctionPost]
) – Optional function to perform aftermain
to modify output data.data_as_file (
bool
) – Whether the data input is aPath
or anumpy.ndarray
.param_schema (
Dict[str
,Dict[str
,Any]]
) –Optional schema of the algorithm’s input parameters needed by
timeeval_experiments.algorithm_configurator.AlgorithmConfigurator
. Schema definition:[ "param_name": { "name": str "defaultValue": Any "description": str "type": str }, ]
param_config (
timeeval.params.ParameterConfig
) – Optional object of type ParameterConfig to define a search grid or fixed parameters.training_type (
timeeval.data_types.TrainingType
) – Definition of training type to receive the correct dataset formats (needed if TimeEval is run withforce_training_type_match
config).input_dimensionality (
timeeval.data_types.InputDimensionality
) – Definition of training type to receive the correct dataset formats (needed if TimeEval is run withforce_dimensionality_match
config option).
Examples
Create a baseline algorithm that always assigns a normal anomaly score:
>>> import numpy as np >>> from timeeval import Algorithm >>> from timeeval.adapters import FunctionAdapter >>> my_fn = lambda X, args: np.zeros(len(X)) >>> Algorithm(name="Test Algorithm", main=FunctionAdapter(my_fn), data_as_file=False)
- input_dimensionality: InputDimensionality = 'univariate'¶
- param_config: ParameterConfig = <timeeval.params.base.FixedParameters object>¶
- postprocess: Optional[TSFunctionPost] = None¶
- preprocess: Optional[TSFunction] = None¶
- training_type: TrainingType = 'unsupervised'¶
timeeval.InputDimensionality¶
- class timeeval.InputDimensionality(value)¶
Bases:
Enum
Input dimensionality supported by an algorithm or of a dataset.
TimeEval distinguishes between univariate and multivariate datasets / time series.
- MULTIVARIATE = 'multivariate'¶
Multivariate datasets have 2 or more features/dimensions/channels.
A multivariate algorithm can process univariate or multivariate datasets.
- UNIVARIATE = 'univariate'¶
Univariate datasets consist of a single feature/dimension/channel.
An univariate algorithm can process only a dataset with a single feature/dimension/channel.
- static from_dimensions(n: int) InputDimensionality ¶
Converts the feature/dimension/channel count to an Enum-object.
timeeval.TrainingType¶
- class timeeval.TrainingType(value)¶
Bases:
Enum
Training type of algorithm or dataset.
TimeEval distinguishes between unsupervised, semi-supervised, and supervised algorithms.
- SEMI_SUPERVISED = 'semi-supervised'¶
A semi-supervised algorithm requires normal data for training.
A semi-supervised dataset consists of a training time series with normal data (no anomalies; all labels are 0) and a test time series.
- SUPERVISED = 'supervised'¶
A supervised algorithm requires training data with anomalies.
A supervised dataset consists of a training time series with anomalies and a test time series.
- UNSUPERVISED = 'unsupervised'¶
An unsupervised algorithm does not require any training data.
An unsupervised dataset consists only of a single test time series.
- static from_text(name: str) TrainingType ¶
Converts the string-representation to an Enum-object.
timeeval.RemoteConfiguration¶
- class timeeval.RemoteConfiguration(scheduler_host: str = 'localhost', scheduler_port: int = 8786, worker_hosts: ~typing.List[str] = <factory>, remote_python: str = <factory>, kwargs_overwrites: ~typing.Dict[str, ~typing.Any] = <factory>, dask_logging_file_level: str = 'INFO', dask_logging_console_level: str = 'INFO', dask_logging_filename: str = 'dask.log')¶
This class holds the configuration for distributed TimeEval.
TimeEval uses a
dask.distributed.SSHCluster
to distribute the evaluation tasks to multiple compute nodes. Please read the Dask documentation carefully and then use the constructor arguments to setup a TimeEval cluster.- Parameters
scheduler_host (
str
) – IP address or hostname for thedistributed.Scheduler
. This node will be responsible to coordinate the cluster. The scheduler does not perform any evaluations.scheduler_port (
int
) – Port for the scheduler.worker_hosts (
List[str]
) – List of IP address or hostnames for thedistributed.Worker
. These nodes will execute the evaluation tasks.remote_python (
str
) – Path to the Python-executable. If you set up all your nodes in the same way, the default is fine.kwargs_overwrites (
dict
) –Use this option to overwrite any configuration options of the
SSHCluster
.Warning
Only use if you know what you are doing!
dask_logging_file_level (
str
) – Logging level for the file-based Dask logger.dask_logging_console_level (
str
) – Logging level for the console-based Dask logger.dask_logging_filename (
str
) – Name of the Dask logging file without any parent paths. Each node will write its own logging file and TimeEval will automatically postfix the filenames with the hostname and place the Dask logging files into theresults_path
.
Examples
Two-node cluster where the first node hosts the scheduler but also takes part in the evaluation:
>>> from timeeval import TimeEval, RemoteConfiguration >>> config = RemoteConfiguration(scheduler_host="192.168.1.1", worker_hosts=["192.168.1.1", "192.168.1.2"]) >>> TimeEval(dm=..., datasets=[], algorithms=[], distributed=True, remote_config=config)
timeeval.ResourceConstraints¶
- class timeeval.ResourceConstraints(tasks_per_host: int = 1, task_memory_limit: ~typing.Optional[int] = None, task_cpu_limit: ~typing.Optional[float] = None, train_timeout: ~durations.duration.Duration = <Duration 8 hours>, execute_timeout: ~durations.duration.Duration = <Duration 8 hours>, use_preliminary_model_on_train_timeout: bool = True, use_preliminary_scores_on_execute_timeout: bool = True)¶
Use this class to configure resource constraints and how TimeEval deals with preliminary results.
Warning
Resource constraints are just supported by the
DockerAdapter
!For docker: Swap is always disabled. Resource constraints are enforced using explicit resource limits on the Docker container.
- Parameters
tasks_per_host (
int
) –Specify, how many evaluation tasks are executed on each host. This setting influences the default memory and CPU limits if
task_memory_limit
andtask_cpu_limit
areNone
: the available resources of the node are shared equally between the tasks.Because each tasks, in effect, trains or executes a time series anomaly detection algorithm, the tasks are resource-intensive, which means that over-provisioning is not useful and could decrease overall performance. If runtime measurements are taken, make sure that no resources are shared between the tasks!
task_memory_limit (
Optional[int]
) – Specify the maximum allowed memory in Bytes. You can useMB
andGB
for better readability. This setting limits the available main memory per task to a fixed value.task_cpu_limit (
Optional[float]
) – Specify the maximum allowed CPU usage in fractions of CPUs (e.g. 0.25 means: only use 1/4 of a single CPU core). Usually, it is advisable to use whole CPU cores (e.g. 1.0 for 1 CPU core, 2.0 for 2 CPU cores, etc.).train_timeout (
Duration
) – Default timeout for training an algorithm. This value can be overridden for each algorithm in itsDockerAdapter
.execute_timeout (
Duration
) – Default timeout for executing an algorithm. This value can be overridden for each algorithm in itsDockerAdapter
.use_preliminary_model_on_train_timeout (
bool
) – If this option is enabled (default), then algorithms can save preliminary models (model checkpoints) to disk and TimeEval will use the last preliminary model if the training step runs into the training timeout. This is especially useful for machine learning algorithms that use an iterative training process (e.g. using SGD). As long as the algorithm implementation stores the best-so-far model after each training epoch, the training must not be limited by the number of epochs but just by the training time.use_preliminary_scores_on_execute_timeout (
bool
) – If this option is enabled (default) and an algorithm exceeds the execution timeout, TimeEval will look for any preliminary result. This allows the evaluation of progressive algorithms that output a rough result, refine it over time, and would otherwise run into the execution timeout.
- static default_constraints() ResourceConstraints ¶
Creates a configuration object with the default resource constraints.
- execute_timeout: Duration = <Duration 8 hours>¶
- get_compute_resource_limits(memory_overwrite: Optional[int] = None, cpu_overwrite: Optional[float] = None) Tuple[int, float] ¶
Calculates the resource constraints for a single task.
There are three sources for resource limits (in decreasing priority):
Overwrites (passed to this function as arguments)
Explicitly set resource limits (on this object using task_memory_limit and task_cpu_limit)
Default resource constraints
Overall default:
1 task per node using all available cores and RAM (except small margin for OS).
When multiple tasks are specified, the resources are equally shared between all concurrent tasks. This means that CPU limit is set based on node CPU count divided by the number of tasks and memory limit is set based on total memory of node minus 1 GB (for OS) divided by the number of tasks.
Attention
Must be called on the node that will execute the task!
- Parameters
- Returns
memory_limit, cpu_limit – Tuple of memory and CPU limit. Memory limit is expressed in Bytes and CPU limit is expressed in fractions of CPUs (e.g. 0.25 means: only use 1/4 of a single CPU core).
- Return type
Tuple[int,float]
- get_execute_timeout(timeout_overwrite: Optional[Duration] = None) Duration ¶
Returns the maximum runtime of an execution task in seconds.
- Parameters
timeout_overwrite (
Duration
) – If this is set, it will overwrite the global timeout.- Returns
execute_timeout – The execution timeout with the highest precedence (method overwrite then global configuration).
- Return type
Duration
- get_train_timeout(timeout_overwrite: Optional[Duration] = None) Duration ¶
Returns the maximum runtime of a training task in seconds.
- Parameters
timeout_overwrite (
Duration
) – If this is set, it will overwrite the global timeout.- Returns
train_timeout – The training timeout with the highest precedence (method overwrite then global configuration).
- Return type
Duration
- train_timeout: Duration = <Duration 8 hours>¶
- timeeval.resource_constraints.GB = 1073741824¶
\(1 GB = 2^{30} \text{Bytes}\)
Can be used to set the memory limit.
Examples
>>> from timeeval.resource_constraints import ResourceConstraints, GB >>> ResourceConstraints(task_memory_limit=1 * GB)
- timeeval.resource_constraints.MB = 1048576¶
\(1 MB = 2^{20} \text{Bytes}\)
Can be used to set the memory limit.
Examples
>>> from timeeval.resource_constraints import ResourceConstraints, MB >>> ResourceConstraints(task_memory_limit=500 * MB)
timeeval.constants¶
- class timeeval.constants.HPI_CLUSTER¶
Cluster constant for the HPI cluster.
These constants are applicable only for the HPI infrastructure and might not be useful for you.
- BENCHMARK = 'benchmark'¶
- CORRELATION_ANOMALIES = 'correlation-anomalies'¶
- MULTIVARIATE_ANOMALY_TEST_CASES = 'multivariate-anomaly-test-cases'¶
- MULTIVARIATE_TEST_CASES = 'multivariate-test-cases'¶
- UNIVARIATE_ANOMALY_TEST_CASES = 'univariate-anomaly-test-cases'¶
- VARIABLE_LENGTH_TEST_CASES = 'variable-length'¶
- akita_dataset_paths: Dict[str, Path] = {'benchmark': PosixPath('/home/projects/akita/data/benchmark-data/data-processed'), 'correlation-anomalies': PosixPath('/home/projects/akita/data/correlation-anomalies'), 'multivariate-anomaly-test-cases': PosixPath('/home/projects/akita/data/multivariate-anomaly-test-cases'), 'multivariate-test-cases': PosixPath('/home/projects/akita/data/multivariate-test-cases'), 'univariate-anomaly-test-cases': PosixPath('/home/projects/akita/data/univariate-anomaly-test-cases'), 'variable-length': PosixPath('/home/projects/akita/data/variable-length')}¶
This dictionary contains the paths to the dataset collection folders.
- nodes: List[str] = ['odin01', 'odin02', 'odin03', 'odin04', 'odin05', 'odin06', 'odin07', 'odin08', 'odin09', 'odin10', 'odin11', 'odin12', 'odin13', 'odin14']¶
All nodes of the homogenous HPI cluster.
- nodes_ip: List[str] = ['172.20.11.101', '172.20.11.102', '172.20.11.103', '172.20.11.104', '172.20.11.105', '172.20.11.106', '172.20.11.107', '172.20.11.108', '172.20.11.109', '172.20.11.110', '172.20.11.111', '172.20.11.112', '172.20.11.113', '172.20.11.114']¶
All IP addresses of the nodes in the homogenous HPI cluster.
timeeval.data_types.ExecutionType¶
- class timeeval.data_types.ExecutionType(value)¶
Bases:
Enum
Enum used to indicate the execution type of algorithms.
TimeEval calls each algorithm up to two times with two different execution types and passes the current execution type as an object of this class to the algorithm adapter implementation.
Depending on the algorithm’s
timeeval.TrainingType
, it requires a training step. TimeEval will call these algorithms first with the execution type set toTRAIN
. Then, for all algorithms, the algorithm is called with execution typeEXECUTE
.- EXECUTE = 'execute'¶
- TRAIN = 'train'¶
timeeval.adapters package¶
timeeval.adapters.base module¶
- class timeeval.adapters.base.Adapter¶
Bases:
ABC
The base class for all adapters. An adapter is a wrapper around an anomaly detection algorithm that allows to execute it in a standardized way with the TimeEval framework. A subclass of Adapter must implement the _call method that executes the algorithm and returns the results. Optionally, it can also implement the get_prepare_fn and get_finalize_fn methods that are called before and after the execution of the algorithm, respectively.
timeeval.adapters.distributed module¶
- class timeeval.adapters.distributed.DistributedAdapter(algorithm: Callable[[Union[ndarray, Path], dict], Union[ndarray, Path]], remote_command: str, remote_user: str, remote_hosts: List[str])¶
Bases:
Adapter
An adapter that allows to run a function as an anomaly detector on multiple remote machines. So far, this adapter only supports TSFunctions as algorithms. Please, be aware that you need password-less ssh to the remote machines!
Warning
This adapter is deprecated and will be removed in a future version of TimeEval.
timeeval.adapters.docker module¶
- class timeeval.adapters.docker.AlgorithmInterface(dataInput: pathlib.PurePath, dataOutput: pathlib.PurePath, modelInput: pathlib.PurePath, modelOutput: pathlib.PurePath, executionType: timeeval.data_types.ExecutionType, customParameters: Dict[str, Any] = <factory>)¶
Bases:
object
- executionType: ExecutionType¶
- class timeeval.adapters.docker.DockerAdapter(image_name: str, tag: str = 'latest', group_privileges: str = 'akita', skip_pull: bool = False, timeout: Optional[Duration] = None, memory_limit_overwrite: Optional[int] = None, cpu_limit_overwrite: Optional[float] = None)¶
Bases:
Adapter
An adapter that allows to run a Docker image as an anomaly detector. You can find a list of available Docker images on GitHub.
- Parameters
image_name (
str
) – The name of the Docker image to run.tag (
str
) – The tag of the Docker image to run. Defaults to “latest”.group_privileges (
str
) – The group privileges to use for the Docker container. Defaults to “akita”.skip_pull (
bool
) – Whether to skip pulling the Docker image. Defaults to False.timeout (
Optional[Duration]
) – The timeout for the Docker container. If not set, the timeout is taken from theResourceConstraints
.memory_limit_overwrite (
Optional[int]
) – The memory limit for the Docker container. If not set, the memory limit is taken from theResourceConstraints
.cpu_limit_overwrite (
Optional[float]
) – The CPU limit for the Docker container. If not set, the CPU limit is taken from theResourceConstraints
.
- class timeeval.adapters.docker.DockerJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶
Bases:
NumpyEncoder
- default(o: Any) Any ¶
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
timeeval.adapters.function module¶
timeeval.adapters.jar module¶
- class timeeval.adapters.jar.JarAdapter(jar_file: str, output_file: str, args: List[Any], kwargs: Dict[str, Any], verbose: bool = False)¶
Bases:
Adapter
An adapter that allows to run a jar file as an anomaly detector.
Warning
This adapter is deprecated and will be removed in a future version of TimeEval.
- Parameters
jar_file (
str
) – The path to the jar file to run.output_file (
str
) – The path to the file to which the jar file writes its output.args (
List[Any]
) – The arguments to pass to the jar file.kwargs (
Dict[str
,Any]
) – The keyword arguments to pass to the jar file.verbose (
bool
) – Whether to print the output of the jar file to the console.
timeeval.adapters.multivar module¶
- class timeeval.adapters.multivar.AggregationMethod(value)¶
Bases:
Enum
An enum that specifies how to aggregate the anomaly scores of the channels.
- MAX = 2¶
aggregates channel scores using the element-wise max.
- MEAN = 0¶
aggregates channel scores using the element-wise mean.
- MEDIAN = 1¶
aggregates channel scores using the element-wise median.
- SUM_BEFORE = 3¶
sums the channels before running the anomaly detector.
- class timeeval.adapters.multivar.MultivarAdapter(adapter: Adapter, aggregation: AggregationMethod = AggregationMethod.MEAN)¶
Bases:
Adapter
An adapter that allows to apply univariate anomaly detectors to multiple dimensions of a timeseries. In one case, the adapter runs the anomaly detector on each dimension separately and aggregates the results using the specified aggregation method. In the other case, the adapter combines the dimensions into a single timeseries and runs the anomaly detector on the combined timeseries.
- Parameters
adapter (
Adapter
) – TheAdapter
that runs the anomaly detector on each dimension.aggregation (
AggregationMethod
) – TheAggregationMethod
to use to combine the anomaly scores of the dimensions.
timeeval.algorithms package¶
timeeval.algorithms.arima¶
- timeeval.algorithms.arima(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
ARIMA
Anomoly detector using ARIMA estimation and (default: euclidean) distance function to calculate prediction error as anomaly score
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- window_size: int
Size of sliding window (also used as prediction window size) (default:
20
)- max_lag: int
Number of points, after which the ARIMA model is re-fitted to the data to deal with trends and shifts (default:
30000
)- p_start: int
Minimum AR-order for the auto-ARIMA process (default:
1
)- q_start: int
Minimum MA-order for the auto-ARIMA process (default:
1
)- max_p: int
Maximum AR-order for the auto-ARIMA process (default:
5
)- max_q: int
Maximum MA-order for the auto-ARIMA process (default:
5
)- differencing_degree: int
Differencing degree for the auto-ARIMA process (default:
0
)- distance_metric: enum[Euclidean,Mahalanobis,Garch,SSA,Fourier,DTW,EDRS,TWED]
Distance measure used to calculate the prediction error = anomaly score (default:
Euclidean
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the ARIMA algorithm.- Return type
timeeval.algorithms.autoencoder¶
- timeeval.algorithms.autoencoder(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
AutoEncoder (AE)
Implementation of https://dl.acm.org/doi/10.1145/2689746.2689747
Algorithm Parameters:
- latent_size: int
Dimensionality of the latent space (default:
32
)- epochs: int
Number of training epochs (default:
10
)- learning_rate: float
Learning rate (default:
0.005
)- split: float
Fraction to split training data by for validation (default:
0.8
)- early_stopping_delta: float
If loss is delta or less smaller for patience epochs, stop (default:
0.5
)- early_stopping_patience: int
If loss is delta or less smaller for patience epochs, stop (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the AutoEncoder (AE) algorithm.- Return type
timeeval.algorithms.bagel¶
- timeeval.algorithms.bagel(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Bagel
Implementation of https://doi.org/10.1109/PCCC.2018.8710885
Algorithm Parameters:
- window_size: int
Size of sliding windows (default:
120
)- latent_size: int
Dimensionality of encoding (default:
8
)- hidden_layer_shape: List[int]
NN hidden layers structure (default:
[100, 100]
)- dropout: float
Rate of conditional dropout used (default:
0.1
)- cuda: boolean
Use GPU for training (default:
False
)- epochs: int
Number of passes over the entire dataset (default:
50
)- batch_size: int
Batch size for input data (default:
128
)- split: float
Fraction to split training data by for validation (default:
0.8
)- early_stopping_delta: float
If loss is delta or less smaller for patience epochs, stop (default:
0.5
)- early_stopping_patience: int
If loss is delta or less smaller for patience epochs, stop (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Bagel algorithm.- Return type
timeeval.algorithms.baseline_increasing¶
- timeeval.algorithms.baseline_increasing(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Increasing Baseline
Baseline that returns a score that steadily increases from 0 to 1
Algorithm Parameters:
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Increasing Baseline algorithm.- Return type
timeeval.algorithms.baseline_normal¶
- timeeval.algorithms.baseline_normal(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Normal Baseline
Baseline that returns a score of all zeros
Algorithm Parameters:
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Normal Baseline algorithm.- Return type
timeeval.algorithms.baseline_random¶
- timeeval.algorithms.baseline_random(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Random Baseline
Baseline that returns a random score between 0 and 1
Algorithm Parameters:
- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Random Baseline algorithm.- Return type
timeeval.algorithms.cblof¶
- timeeval.algorithms.cblof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
CBLOF
Implementation of https://doi.org/10.1016/S0167-8655(03)00003-5.
Algorithm Parameters:
- n_clusters: int
The number of clusters to form as well as the number of centroids to generate. (default:
8
)- alpha: float
Coefficient for deciding small and large clusters. The ratio of the number of samples in large clusters to the number of samples in small clusters. (0.5 < alpha < 1) (default:
0.9
)- beta: float
Coefficient for deciding small and large clusters. For a list sorted clusters by size |C1|, |C2|, …, |Cn|, beta = |Ck|/|Ck-1|. (1.0 < beta ) (default:
5
)- use_weights: boolean
If set to True, the size of clusters are used as weights in outlier score calculation. (default:
false
)- random_state: int
Seed for random number generation. (default:
42
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If -1, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the CBLOF algorithm.- Return type
timeeval.algorithms.cof¶
- timeeval.algorithms.cof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
COF
Implementation of https://doi.org/10.1007/3-540-47887-6_53.
Algorithm Parameters:
- n_neighbors: int
Number of neighbors to use by default for k neighbors queries. Note that n_neighbors should be less than the number of samples. If n_neighbors is larger than the number of samples provided, all samples will be used. (default:
20
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the COF algorithm.- Return type
timeeval.algorithms.copod¶
- timeeval.algorithms.copod(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
COPOD
Implementation of https://publications.pik-potsdam.de/pubman/faces/ViewItemOverviewPage.jsp?itemId=item_24536.
Algorithm Parameters:
- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the COPOD algorithm.- Return type
timeeval.algorithms.dae¶
- timeeval.algorithms.dae(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DenoisingAutoEncoder (DAE)
Implementation of https://dl.acm.org/doi/10.1145/2689746.2689747
Algorithm Parameters:
- latent_size: int
Dimensionality of latent space (default:
32
)- epochs: int
Number of training epochs (default:
10
)- learning_rate: float
Learning rate (default:
0.005
)- noise_ratio: float
Percentage of points that are converted to noise (0) during training (default:
0.1
)- split: float
Fraction to split training data by for validation (default:
0.8
)- early_stopping_delta: float
If loss is delta or less smaller for patience epochs, stop (default:
0.5
)- early_stopping_patience: int
If loss is delta or less smaller for patience epochs, stop (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DenoisingAutoEncoder (DAE) algorithm.- Return type
timeeval.algorithms.damp¶
- timeeval.algorithms.damp(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DAMP
Implementation of https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding windows (default:
50
)- n_init_train: int
Fraction of data used to warmup streaming. (default:
100
)- max_lag: int
Maximum size to look back in time. (default:
None
)- lookahead: int
Amount of steps to look into the future for deciding which future windows to skip analyzing. (default:
None
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DAMP algorithm.- Return type
timeeval.algorithms.dbstream¶
- timeeval.algorithms.dbstream(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DBStream
A simple density-based clustering algorithm that assigns data points to micro-clusters with a given radius and implements shared-density-based reclustering.
Algorithm Parameters:
- window_size: int
The length of the subsequences the dataset should be splitted in. (default:
20
)- radius: float
The radius of micro-clusters. (default:
0.1
)- lambda: float
The lambda used in the fading function. (default:
0.001
)- distance_metric: enum[Euclidean,Manhattan,Maximum]
The metric used to calculate distances. If shared_density is TRUE this has to be Euclidian. (default:
Euclidean
)- shared_density: boolean
Record shared density information. If set to TRUE then shared density is used for reclustering, otherwise reachability is used (overlapping clusters with less than r∗(1−alpha) distance are clustered together) (default:
True
)- n_clusters: int
The number of macro clusters to be returned if macro is true. (default:
0
)- alpha: float
For shared density: The minimum proportion of shared points between to clus-ters to warrant combining them (a suitable value for 2D data is .3). For reacha-bility clustering it is a distance factor (default:
0.1
)- min_weight: float
The proportion of the total weight a macro-cluster needs to have not to be noise(between 0 and 1). (default:
0.0
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DBStream algorithm.- Return type
timeeval.algorithms.deepant¶
- timeeval.algorithms.deepant(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DeepAnT
Adapted community implementation (https://github.com/dev-aadarsh/DeepAnT)
Algorithm Parameters:
- epochs: int
Number of training epochs (default:
50
)- window_size: int
History window: Number of time stamps in history, which are taken into account (default:
45
)- prediction_window_size: int
Prediction window: Number of data points that will be predicted from each window (default:
1
)- learning_rate: float
Learning rate (default:
1e-05
)- batch_size: int
Batch size for input data (default:
45
)- random_state: int
Seed for the random number generator (default:
42
)- split: float
Train-validation split for early stopping (default:
0.8
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DeepAnT algorithm.- Return type
timeeval.algorithms.deepnap¶
- timeeval.algorithms.deepnap(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DeepNAP
Implementation of https://doi.org/10.1016/j.ins.2018.05.020
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding windows (default:
15
)- partial_sequence_length: int
Number of points taken from the beginning of the predicted window used to build a partial sequence (with neighboring points) that is passed through another linear network. (default:
3
)- lstm_layers: int
Number of LSTM layers within encoder and decoder (default:
2
)- rnn_hidden_size: int
Number of neurons in LSTM hidden layer (default:
200
)- dropout: float
Probability for a neuron to be zeroed for regularization (default:
0.5
)- linear_hidden_size: int
Number of neurons in linear hidden layer (default:
100
)- batch_size: int
Number of instances trained at the same time (default:
32
)- validation_batch_size: int
Number of instances used for validation at the same time (default:
256
)- epochs: int
Number of training iterations over entire dataset; recommended value: 256 (default:
1
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.001
)- split: float
Train-validation split for early stopping (default:
0.8
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DeepNAP algorithm.- Return type
timeeval.algorithms.donut¶
- timeeval.algorithms.donut(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Donut
Implementation of https://doi.org/10.1145/3178876.3185996
Algorithm Parameters:
- window_size: int
Size of sliding windows (default:
120
)- latent_size: int
Dimensionality of encoding (default:
5
)- regularization: float
Factor for regularization in loss (default:
0.001
)- linear_hidden_size: int
Size of linear hidden layer (default:
100
)- epochs: int
Number of training passes over entire dataset (default:
256
)- random_state: int
Seed for random number generation. (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Donut algorithm.- Return type
timeeval.algorithms.dspot¶
- timeeval.algorithms.dspot(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DSPOT
Implementation of https://doi.org/10.1145/3097983.3098144.
Algorithm Parameters:
- q: float
Main parameter: maximum probability of an abnormal event (default:
0.001
)- n_init: int
Calibration: number of data used to calibrate algorithm. The user must ensure that n_init * (1 - level) > 10 (default:
1000
)- level: float
Calibration: proportion of initial data (n_init) not involved in the tail distribution fit during initialization. The user must ensure that n_init * (1 - level) > 10 (default:
0.99
)- up: boolean
Compute upper thresholds (default:
true
)- down: boolean
Compute lower thresholds (default:
true
)- alert: boolean
Enable alert triggering, if false, even out-of-bounds-data will be taken into account for tail fit (default:
true
)- bounded: boolean
Performance: enable memory bounding (also improves performance) (default:
true
)- max_excess: int
Performance: maximum number of data stored to perform the tail fit when memory bounding is enabled (default:
200
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DSPOT algorithm.- Return type
timeeval.algorithms.dwt_mlead¶
- timeeval.algorithms.dwt_mlead(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
DWT-MLEAD
Implementation of http://blogs.gm.fh-koeln.de/ciop/files/2019/01/thillwavelet.pdf.
Algorithm Parameters:
- start_level: int
First discrete wavelet decomposition level to consider (default:
3
)- quantile_epsilon: float
Percentage of windows to flag as anomalous within each decomposition level’s coefficients (default:
0.01
)- random_state: int
Seed for the random number generator (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the DWT-MLEAD algorithm.- Return type
timeeval.algorithms.eif¶
- timeeval.algorithms.eif(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Extended Isolation Forest (EIF)
Extenstion to the basic isolation forest. Implementation of https://doi.org/10.1109/TKDE.2019.2947676. Code from https://github.com/sahandha/eif
Algorithm Parameters:
- n_trees: int
The number of decision trees (base estimators) in the forest (ensemble). (default:
200
)- max_samples: float
The number of samples to draw from X to train each base estimator: max_samples * X.shape[0]. If unspecified (null), then max_samples=min(256, X.shape[0]). (default:
None
)- extension_level: int
Extension level 0 resembles standard isolation forest. If unspecified (null), then extension_level=X.shape[1] - 1. (default:
None
)- limit: int
The maximum allowed tree depth. This is by default set to average length of unsucessful search in a binary tree. (default:
None
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Extended Isolation Forest (EIF) algorithm.- Return type
timeeval.algorithms.encdec_ad¶
- timeeval.algorithms.encdec_ad(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
EncDec-AD
Implementation of https://arxiv.org/pdf/1607.00148.pdf
Algorithm Parameters:
- lstm_layers: int
Number of LSTM layers within encoder and decoder (default:
1
)- anomaly_window_size: int
Size of the sliding windows (default:
30
)- latent_size: int
Size of the autoencoder’s latent space (embedding size) (default:
40
)- batch_size: int
Number of instances trained at the same time (default:
32
)- validation_batch_size: int
Number of instances used for validation at the same time (default:
128
)- epochs: int
Number of training iterations over entire dataset (default:
50
)- split: float
Train-validation split for early stopping (default:
0.9
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.001
)- random_state: int
Seed for the random number generator (default:
42
)- window_size: int
Size of the sliding windows (default:
30
)- test_batch_size: int
Number of instances used for testing at the same time (default:
128
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the EncDec-AD algorithm.- Return type
timeeval.algorithms.ensemble_gi¶
- timeeval.algorithms.ensemble_gi(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Ensemble GI
Implementation of https://doi.org/10.5441/002/edbt.2020.09
Algorithm Parameters:
- anomaly_window_size: int
The size of the sliding window, in which w regions are made discrete. (default:
50
)- n_estimators: int
The number of models in the ensemble. (default:
10
)- max_paa_transform_size: int
Maximum size of the embedding space used by PAA (SAX word size w) (default:
20
)- max_alphabet_size: int
Maximum number of symbols used for discretization by SAX (lpha) (default:
10
)- selectivity: float
The fraction of models in the ensemble included in the end result. (default:
0.8
)- random_state: int
Seed for the random number generator (default:
42
)- n_jobs: int
The number of parallel jobs to use for executing the models. If -1, then the number of jobs is set to the number of CPU cores. (default:
1
)- window_method: enum[sliding,tumbling,orig]
Windowing method used to create subsequences. The original implementation had a strange method (orig) that is similar to tumbling, the paper uses a sliding window. However, sliding is significantly slower than tumbling while producing better results (higher anomaly score resolution). orig should not be used! (default:
sliding
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Ensemble GI algorithm.- Return type
timeeval.algorithms.fast_mcd¶
- timeeval.algorithms.fast_mcd(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Fast-MCD
Implementation of https://doi.org/10.2307/1270566
Algorithm Parameters:
- store_precision: boolean
Specify if the estimated precision is stored (default:
True
)- support_fraction: float
The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: (n_sample + n_features + 1) / 2. The parameter must be in the range (0, 1). (default:
None
)- random_state: int
Determines the pseudo random number generator for shuffling the data. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Fast-MCD algorithm.- Return type
timeeval.algorithms.fft¶
- timeeval.algorithms.fft(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
FFT
Implementation of https://dl.acm.org/doi/10.5555/1789574.1789615 proudly provided by members of the HPI AKITA project.
Algorithm Parameters:
- fft_parameters: int
Number of parameters to be used in IFFT for creating the fit. (default:
5
)- context_window_size: int
Centered window of neighbors to consider for the calculation of local outliers’ z_scores (default:
21
)- local_outlier_threshold: float
Outlier threshold in multiples of sigma for local outliers (default:
0.6
)- max_anomaly_window_size: int
Maximum size of outlier regions. (default:
50
)- max_sign_change_distance: int
Maximum gap between two closed oppositely signed local outliers to detect a sign change for outlier region grouping. (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the FFT algorithm.- Return type
timeeval.algorithms.generic_rf¶
- timeeval.algorithms.generic_rf(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Random Forest Regressor (RR)
A generic windowed forecasting method using random forest regression (requested by RollsRoyce). The forecasting error is used as anomaly score.
Algorithm Parameters:
- train_window_size: int
Size of the training windows. Always predicts a single point! (default:
50
)- n_trees: int
The number of trees in the forest. (default:
100
)- max_features_method: enum[auto,sqrt,log2]
The number of features to consider when looking for the best split between trees: ‘auto’: max_features=n_features, ‘sqrt’: max_features=sqrt(n_features), ‘log2’: max_features=log2(n_features). (default:
auto
)- bootstrap: boolean
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. (default:
True
)- max_samples: float
If bootstrap is True, the number of samples to draw from X to train each base estimator. (default:
None
)- random_state: int
Seeds the randomness of the bootstrapping and the sampling of the features. (default:
42
)- verbose: int
Controls logging verbosity. (default:
0
)- n_jobs: int
The number of jobs to run in parallel. -1 means using all processors (default:
1
)- max_depth: int
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. (default:
None
)- min_samples_split: int
The minimum number of samples required to split an internal node. (default:
2
)- min_samples_leaf: int
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Random Forest Regressor (RR) algorithm.- Return type
timeeval.algorithms.generic_xgb¶
- timeeval.algorithms.generic_xgb(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
XGBoosting (RR)
A generic windowed forecasting method using XGBoost regression (requested by RollsRoyce). The forecasting error is used as anomaly score.
Algorithm Parameters:
- train_window_size: int
Size of the training windows. Always predicts a single point! (default:
50
)- n_estimators: int
Number of gradient boosted trees. Equivalent to number of boosting rounds. (default:
100
)- learning_rate: float
Boosting learning rate (xgb’s eta) (default:
0.1
)- booster: enum[gbtree,gblinear,dart]
Booster to use (default:
gbtree
)- tree_method: enum[auto,exact,approx,hist]
Tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. exact is slowest, hist is fastest. Prefer hist and approx over exact, because for most datasets they have comparative quality, but are significantly faster. (default:
auto
)- n_trees: int
If >1, then boosting random forests with n_trees trees. (default:
1
)- max_depth: int
Maximum tree depth for base learners. (default:
None
)- max_samples: float
Subsample ratio of the training instance. (default:
None
)- colsample_bytree: float
Subsample ratio of columns when constructing each tree. (default:
None
)- colsample_bylevel: float
Subsample ratio of columns for each level. (default:
None
)- colsample_bynode: float
Subsample ratio of columns for each split. (default:
None
)- random_state: int
Seeds the randomness of the bootstrapping and the sampling of the features. (default:
42
)- verbose: int
Controls logging verbosity. (default:
0
)- n_jobs: int
The number of jobs to run in parallel. -1 means using all processors. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the XGBoosting (RR) algorithm.- Return type
timeeval.algorithms.grammarviz3¶
- timeeval.algorithms.grammarviz3(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
GrammarViz
Implementation of https://doi.org/10.1145/3051126.
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding window. Equal to the discord length! (default:
170
)- paa_transform_size: int
Size of the embedding space used by PAA (paper calls it number of frames or SAX word size w) (performance parameter) (default:
4
)- alphabet_size: int
Number of symbols used for discretization by SAX (paper uses lpha) (performance parameter) (default:
4
)- normalization_threshold: float
Threshold for Z-normalization of subsequences (windows). If variance of a window is higher than this threshold, it is normalized. (default:
0.01
)- random_state: int
Seed for the random number generator (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the GrammarViz algorithm.- Return type
timeeval.algorithms.grammarviz3_multi¶
- timeeval.algorithms.grammarviz3_multi(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
GrammarViz-Multivariate
Multivariate extension of the GrammarViz3 algorithm.
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding window. Equal to the discord length! (default:
100
)- output_mode: int
Algorithm to use for output [Density, Discord, Full] (default:
2
)- multi_strategy: int
Strategy to handle multivariate output [Merge all, Merge clustered, All separate] (default:
1
)- paa_transform_size: int
Size of the embedding space used by PAA (paper calls it number of frames or SAX word size w) (performance parameter) (default:
5
)- alphabet_size: int
Number of symbols used for discretization by SAX (paper uses lpha) (performance parameter) (default:
6
)- normalization_threshold: float
Threshold for Z-normalization of subsequences (windows). If variance of a window is higher than this threshold, it is normalized. (default:
0.01
)- random_state: int
Seed for the random number generator (default:
42
)- n_discords: int
Number of discords to report when using discord output strategy (default:
10
)- numerosity_reduction: boolean
Disables / enables numerosity reduction strategy (default:
True
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the GrammarViz-Multivariate algorithm.- Return type
timeeval.algorithms.hbos¶
- timeeval.algorithms.hbos(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
HBOS
Implementation of https://citeseerx.ist.psu.edu/viewdoc/citations;jsessionid=2B4E3FB2BB07448253B4D45C3DAC2E95?doi=10.1.1.401.5686.
Algorithm Parameters:
- n_bins: int
The number of bins. (default:
10
)- alpha: float
Regulizing alpha to prevent overflows. (default:
0.1
)- bin_tol: float
Parameter to decide the flexibility while dealing with the samples falling outside the bins. (default:
0.5
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the HBOS algorithm.- Return type
timeeval.algorithms.health_esn¶
- timeeval.algorithms.health_esn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
HealthESN
Implementation of https://doi.org/10.1007/s00521-018-3747-z
Algorithm Parameters:
- linear_hidden_size: int
Hidden units in ESN reservoir. (default:
500
)- prediction_window_size: int
Window of predicted points in the future. (default:
20
)- connectivity: float
How dense the units in the reservoir are connected (= percentage of non-zero weights) (default:
0.25
)- spectral_radius: float
Factor used for random initialization of ESN neural connections. (default:
0.6
)- activation: enum[tanh,sigmoid]
Activation function used for the ESN. (default:
tanh
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the HealthESN algorithm.- Return type
timeeval.algorithms.hif¶
- timeeval.algorithms.hif(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Hybrid Isolation Forest (HIF)
Implementation of https://arxiv.org/abs/1705.03800
Algorithm Parameters:
- n_trees: int
The number of decision trees (base estimators) in the forest (ensemble). (default:
1024
)- max_samples: float
The number of samples to draw from X to train each base estimator: max_samples * X.shape[0]. If unspecified (null), then max_samples=min(256, X.shape[0]). (default:
None
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Hybrid Isolation Forest (HIF) algorithm.- Return type
timeeval.algorithms.hotsax¶
- timeeval.algorithms.hotsax(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
HOT SAX
Implementation of https://doi.org/10.1109/ICDM.2005.79.
Algorithm Parameters:
- num_discords: int
The number of anomalies (discords) to search for in the time series. If not set, the scores for all discords are searched. (default:
None
)- anomaly_window_size: int
Size of the sliding window. Equal to the discord length! (default:
100
)- paa_transform_size: int
Size of the embedding space used by PAA (paper calls it number of frames or SAX word size w) (performance parameter) (default:
3
)- alphabet_size: int
Number of symbols used for discretization by SAX (paper uses lpha) (performance parameter) (default:
3
)- normalization_threshold: float
Threshold for Z-normalization of subsequences (windows). If variance of a window is higher than this threshold, it is normalized. (default:
0.01
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the HOT SAX algorithm.- Return type
timeeval.algorithms.hybrid_knn¶
- timeeval.algorithms.hybrid_knn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Hybrid KNN
Implementation of https://www.hindawi.com/journals/cin/2017/8501683/
Algorithm Parameters:
- linear_layer_shape: List[int]
NN structure with embedding dim as last value (default:
[100, 10]
)- split: float
train-validation split (default:
0.8
)- anomaly_window_size: int
windowing size for time series (default:
20
)- batch_size: int
number of simultaneously trained data instances (default:
64
)- test_batch_size: int
number of simultaneously tested data instances (default:
256
)- epochs: int
number of training iterations over entire dataset (default:
1
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- learning_rate: float
Gradient factor for backpropagation (default:
0.001
)- n_neighbors: int
Defines which neighbour’s distance to use (default:
12
)- n_estimators: int
Defines number of ensembles (default:
3
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Hybrid KNN algorithm.- Return type
timeeval.algorithms.if_lof¶
- timeeval.algorithms.if_lof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
IF-LOF
Isolation Forest - Local Outlier Factor: Uses a 3 step process - Building an isolation forest, pruning the forest with a computed treshhold, and applies local outlier factor to the resulting dataset
Algorithm Parameters:
- n_trees: int
Number of trees in isolation forest (default:
200
)- max_samples: float
The number of samples to draw from X to train each tree: max_samples * X.shape[0]. If unspecified (null), then max_samples=min(256, X.shape[0]). (default:
None
)- n_neighbors: int
Number neighbors to look at in local outlier factor calculation (default:
10
)- alpha: float
Scalar that depends on consideration of the dataset and controls the amount of data to be pruned (default:
0.5
)- m: int
m features with highest scores will be used for pruning (default:
None
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the IF-LOF algorithm.- Return type
timeeval.algorithms.iforest¶
- timeeval.algorithms.iforest(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Isolation Forest (iForest)
Implementation of https://doi.org/10.1145/2133360.2133363.
Algorithm Parameters:
- n_trees: int
The number of decision trees (base estimators) in the forest (ensemble). (default:
100
)- max_samples: float
The number of samples to draw from X to train each base estimator: max_samples * X.shape[0]. If unspecified (null), then max_samples=min(256, n_samples). (default:
None
)- max_features: float
The number of features to draw from X to train each base estimator: max_features * X.shape[1]. (default:
1.0
)- bootstrap: boolean
If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed. (default:
false
)- random_state: int
Seed for random number generation. (default:
42
)- verbose: int
Controls the verbosity of the tree building process logs. (default:
0
)- n_jobs: int
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number of cores. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Isolation Forest (iForest) algorithm.- Return type
timeeval.algorithms.img_embedding_cae¶
- timeeval.algorithms.img_embedding_cae(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
ImageEmbeddingCAE
Implementation of http://arxiv.org/abs/2009.02040
Algorithm Parameters:
- anomaly_window_size: int
length of one time series chunk (tumbling window) (default:
512
)- kernel_size: int
width, height of each convolution kernel (stride is equal to this value) (default:
2
)- num_kernels: int
number of convolution kernels used in each layer (default:
64
)- latent_size: int
number of neurons used in the embedding layer (default:
100
)- leaky_relu_alpha: float
alpha value used for leaky relu activation function (default:
0.03
)- batch_size: int
number of simultaneously trained data instances (default:
32
)- test_batch_size: int
number of simultaneously trained data instances (default:
128
)- learning_rate: float
Gradient factor for backpropagation (default:
0.001
)- epochs: int
number of training iterations over entire dataset (default:
30
)- split: float
train-validation split (default:
0.8
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the ImageEmbeddingCAE algorithm.- Return type
timeeval.algorithms.kmeans¶
- timeeval.algorithms.kmeans(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
k-Means
Implementation of http://robotics.estec.esa.int/i-SAIRAS/isairas2001/papers/Paper_AS012.pdf
Algorithm Parameters:
- n_clusters: int
The number of clusters to form as well as the number of centroids to generate. The bigger n_clusters (k) is, the less noisy the anomaly scores are. (default:
20
)- anomaly_window_size: int
Size of sliding windows. The bigger window_size is, the bigger the anomaly context is. If it’s to big, things seem anomalous that are not. If it’s too small, the algorithm is not able to find anomalous windows and looses its time context. (default:
20
)- stride: int
Stride of sliding windows. It is the step size between windows. The larger stride is, the noisier the scores get. If stride == window_size, they are tumbling windows. (default:
1
)- n_jobs: int
Internal parallelism used (sample-wise in the main loop which assigns each sample to its closest center). If -1 or None, all available CPUs are used. (default:
1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the k-Means algorithm.- Return type
timeeval.algorithms.knn¶
- timeeval.algorithms.knn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
KNN
Implementation of https://doi.org/10.1145/342009.335437.
Algorithm Parameters:
- n_neighbors: int
Number of neighbors to use by default for kneighbors queries. (default:
5
)- leaf_size: int
Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. (default:
30
)- method: enum[largest,mean,median]
‘largest’: use the distance to the kth neighbor as the outlier score, ‘mean’: use the average of all k neighbors as the outlier score, ‘median’: use the median of the distance to k neighbors as the outlier score. (default:
largest
)- radius: float
Range of parameter space to use by default for radius_neighbors queries. (default:
1.0
)- distance_metric_order: int
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances. (default:
2
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the KNN algorithm.- Return type
timeeval.algorithms.laser_dbn¶
- timeeval.algorithms.laser_dbn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
LaserDBN
Implementation of https://doi.org/10.1007/978-3-662-53806-7_3
Algorithm Parameters:
- timesteps: int
Number of time steps the DBN builds probabilities for (min: 2) (default:
2
)- n_bins: int
Number of bins used for discretization. (default:
10
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the LaserDBN algorithm.- Return type
timeeval.algorithms.left_stampi¶
- timeeval.algorithms.left_stampi(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Left STAMPi
Implementation of https://www.cs.ucr.edu/~eamonn/PID4481997_extend_Matrix%20Profile_I.pdf
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding windows (default:
50
)- n_init_train: int
Fraction of data used to warmup streaming. (default:
100
)- random_state: int
Seed for the random number generator (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Left STAMPi algorithm.- Return type
timeeval.algorithms.lof¶
- timeeval.algorithms.lof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
LOF
Implementation of https://doi.org/10.1145/342009.335388.
Algorithm Parameters:
- n_neighbors: int
Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used. (default:
20
)- leaf_size: int
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. (default:
30
)- distance_metric_order: int
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances. (default:
2
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the LOF algorithm.- Return type
timeeval.algorithms.lstm_ad¶
- timeeval.algorithms.lstm_ad(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
LSTM-AD
Implementation of https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2015-56.pdf
Algorithm Parameters:
- lstm_layers: int
Number of stacked LSTM layers (default:
2
)- split: float
Train-validation split for early stopping (default:
0.9
)- window_size: int
(default:
30
)- prediction_window_size: int
Number of points predicted (default:
1
)- batch_size: int
Number of instances trained at the same time (default:
32
)- validation_batch_size: int
Number of instances used for validation at the same time (default:
128
)- epochs: int
Number of training iterations over entire dataset (default:
50
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.001
)- random_state: int
Seed for the random number generator (default:
42
)- test_batch_size: int
Number of instances used for testing at the same time (default:
128
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the LSTM-AD algorithm.- Return type
timeeval.algorithms.lstm_vae¶
- timeeval.algorithms.lstm_vae(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
LSTM-VAE
self implementation of: https://ieeexplore.ieee.org/document/8279425
Algorithm Parameters:
- rnn_hidden_size: int
LTSM cells hidden dimension (default:
5
)- latent_size: int
dimension of latent space (default:
5
)- learning_rate: float
rate at which the gradients are updated (default:
0.001
)- batch_size: int
size of batch given for each iteration (default:
32
)- epochs: int
number of iterations we train the model (default:
10
)- window_size: int
number of datapoints that the model takes once (default:
10
)- lstm_layers: int
number of layers in lstm (default:
10
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the LSTM-VAE algorithm.- Return type
timeeval.algorithms.median_method¶
- timeeval.algorithms.median_method(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
MedianMethod
Implementation of https://doi.org/10.1007/s10115-006-0026-6
Algorithm Parameters:
- neighbourhood_size: int
Specifies the number of time steps to look forward and backward for each data point. (default:
100
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the MedianMethod algorithm.- Return type
timeeval.algorithms.mscred¶
- timeeval.algorithms.mscred(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
MSCRED
Implementation of https://doi.org/10.1609/aaai.v33i01.33011409
Algorithm Parameters:
- windows: List[int]
Number and size of different signature matrices (correlation matrices) to compute as a preprocessing step (default:
[10, 30, 60]
)- gap_time: int
Number of points to skip over between the generation of signature matrices (default:
10
)- window_size: int
Size of the sliding windows (default:
5
)- batch_size: int
Number of instances trained at the same time (default:
32
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.001
)- epochs: int
Number of training iterations over entire dataset (default:
1
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- split: float
Train-validation split for early stopping (default:
0.8
)- test_batch_size: int
Number of instances used for validation and testing at the same time (default:
256
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the MSCRED algorithm.- Return type
timeeval.algorithms.mstamp¶
- timeeval.algorithms.mstamp(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
mSTAMP
Implementation of http://www.cs.ucr.edu/%7Eeamonn/Motif_Discovery_ICDM.pdf
Algorithm Parameters:
- anomaly_window_size: int
Size of the sliding windows (default:
50
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the mSTAMP algorithm.- Return type
timeeval.algorithms.mtad_gat¶
- timeeval.algorithms.mtad_gat(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
MTAD-GAT
Implementation of http://arxiv.org/abs/2009.02040
Algorithm Parameters:
- mag_window_size: int
Window size for sliding window average calculation (default:
3
)- score_window_size: int
Window size for anomaly scoring (default:
40
)- threshold: float
Threshold for SR cleaning (default:
3
)- context_window_size: int
Window for mean in SR cleaning (default:
5
)- kernel_size: int
Kernel size for 1D-convolution (default:
7
)- learning_rate: float
Learning rate for training (default:
0.001
)- epochs: int
Number of times the algorithm trains on the dataset (default:
1
)- batch_size: int
Number of data points propagated in parallel (default:
64
)- window_size: int
Window size for windowing of Time Series (default:
20
)- gamma: float
Importance factor for posterior in scoring (default:
0.8
)- latent_size: int
Embedding size in VAE (default:
300
)- linear_layer_shape: List[int]
Architecture of FC-NN (default:
[300, 300, 300]
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- split: float
Train-validation split for early stopping (default:
0.8
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the MTAD-GAT algorithm.- Return type
timeeval.algorithms.multi_hmm¶
- timeeval.algorithms.multi_hmm(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
MultiHMM
Implementation of https://doi.org/10.1016/j.asoc.2017.06.035
Algorithm Parameters:
- discretizer: enum[sugeno,choquet,fcm]
Available discretizers are “sugeno”, “choquet”, and “fcm”. If only 1 feature in time series, K-Bins discretizer is used. (default:
fcm
)- n_bins: int
Number of bins used for discretization. (default:
10
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the MultiHMM algorithm.- Return type
timeeval.algorithms.multi_norma¶
- timeeval.algorithms.multi_norma(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
MultiNormA
Improved algorithm based on NorM (https://doi.org/10.1109/ICDE48307.2020.00182).
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- anomaly_window_size: int
Sliding window size used to create subsequences (equal to desired anomaly length) (default:
20
)- normal_model_percentage: float
Percentage of (random) subsequences used to build the normal model. (default:
0.5
)- max_motifs: int
Maximum number of used motifs. Important to avoid OOM errors. (default:
4096
)- random_state: int
Seed for random number generation. (default:
42
)- motif_detection: Enum[stomp,random,mixed]
Algorithm to use for motif detection [random, stomp, mixed]. (default:
mixed
)- sum_dims: boolean
Sum all dimensions up before computing dists, otherwise each dim is handled seperately. (default:
False
)- normalize_join: boolean
Apply join normalization heuristic. [false = no normalization, true = normalize] (default:
True
)- join_combine_method: int
how to combine the join values from all dimensions.[0=sum, 1=max, 2=score dims (based on std, mean, range), 3=weight higher vals, 4=vals**channels] (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the MultiNormA algorithm.- Return type
timeeval.algorithms.multi_subsequence_lof¶
- timeeval.algorithms.multi_subsequence_lof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Multi-Sub-LOF
LOF on sliding windows of multivariate time series to detect subsequence anomalies.
Algorithm Parameters:
- window_size: int
Size of the sliding windows to extract subsequences as input to LOF. (default:
100
)- n_neighbors: int
Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used. (default:
20
)- leaf_size: int
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. (default:
30
)- distance_metric_order: int
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances. (default:
2
)- dim_aggregation_method: enum[concat,sum]
Method used to aggregate multiple dimensions, so that LOF can process the subsequence. When ‘concat’, the 2D-matrix is flattened into a 1D-vector; when ‘sum’ the 2D-matrix is aggregated over the channels to get a 1D sliding window. (default:
concat
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Multi-Sub-LOF algorithm.- Return type
timeeval.algorithms.mvalmod¶
- timeeval.algorithms.mvalmod(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
mVALMOD
Implementation of https://doi.org/10.1007/s10618-020-00685-w summed up for every channel.
Algorithm Parameters:
- min_anomaly_window_size: Int
Minimum sliding window size (default:
30
)- max_anomaly_window_size: Int
Maximum sliding window size (default:
40
)- heap_size: Int
Size of the distance profile heap buffer (default:
50
)- exclusion_zone: Float
Size of the exclusion zone as a factor of the window_size. This prevents self-matches. (default:
0.5
)- verbose: Int
Controls logging verbosity. (default:
1
)- random_state: Int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the mVALMOD algorithm.- Return type
timeeval.algorithms.norma¶
- timeeval.algorithms.norma(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
NormA
Improved algorithm based on NorM (https://doi.org/10.1109/ICDE48307.2020.00182).
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- anomaly_window_size: int
Sliding window size used to create subsequences (equal to desired anomaly length) (default:
20
)- normal_model_percentage: float
Percentage of (random) subsequences used to build the normal model. (default:
0.5
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the NormA algorithm.- Return type
timeeval.algorithms.normalizing_flows¶
- timeeval.algorithms.normalizing_flows(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Normalizing Flows
Implementation of https://arxiv.org/abs/1912.09323
Algorithm Parameters:
- n_hidden_features_factor: float
Factor deciding how many hidden features for NFs are used based on number of features (default:
1.0
)- hidden_layer_shape: List[int]
NN hidden layers structure (default:
[100, 100]
)- window_size: int
Window size of sliding window over time series (default:
20
)- split: float
Train-validation split (default:
0.9
)- epochs: int
Number of training epochs (default:
1
)- batch_size: int
How many data instances are trained at the same time. (default:
64
)- test_batch_size: int
How many data instances are tested at the same time. (default:
128
)- teacher_epochs: int
Number of epochs for teacher NF training (default:
1
)- distillation_iterations: int
Number of training steps for distillation (default:
1
)- percentile: float
Percentile defining the tails for anomaly sampling. (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Normalizing Flows algorithm.- Return type
timeeval.algorithms.novelty_svr¶
- timeeval.algorithms.novelty_svr(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
NoveltySVR
Implementation of https://doi.org/10.1145/956750.956828.
Algorithm Parameters:
- n_init_train: int
Number of initial points to fit regression model on. For those points no score is calculated. (default:
500
)- forgetting_time: int
If this is set, points older than forgetting_time are removed from the model (forgotten) (paper: W) (default:
None
)- train_window_size: int
Size of training windows, also called embedding dimensions, used as context to predict the next point (paper: D) (default:
16
)- anomaly_window_size: int
Size of event windows, also called event duration, for which suprising occurences are aggregated. Should not be chosen too large! (paper: n) (default:
6
)- lower_suprise_bound: int
Number of suprising occurences that must be present within an event (see window_size) to regard the event as novel/anomalous (paper: h). Range: 0 < lower_suprise_bound < window_size. If not supplied ‘h = window_size / 2’ is used as default. (default:
None
)- scaling: enum[null,standard,robust,power]
If the data should be scaled/normalized before regression using StandardScaler, RobustScaler, or PowerTransformer (Yeo-Johnson + standard scaling). See https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py. (default:
standard
)- epsilon: float
Specifies epsilon-tube to find suprising occurences in the prediction residuals (resid !> 2eps). Reused as Online SVR parameter: Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. (default:
0.1
)- verbose: int
Controls verbose output. Higher values mean more detailled output [0; 5]. Verbose output of the Online SVR appears not until >=3. (default:
0
)- C: float
Online SVR parameter: Penalty parameter C of the error term. (default:
1.0
)- kernel: enum[linear,poly,rbf,sigmoid,rbf-gaussian,rbf-exp]
Online SVR parameter: Specifies the kernel type to be used in the algorithm. (default:
rbf
)- degree: int
Online SVR parameter: Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. (default:
3
)- gamma: float
Online SVR parameter: Kernel coefficient for ‘poly’, ‘sigmoid’, and ‘rbf’-kernels. If gamma is None then 1/n_features will be used instead. (default:
None
)- coef0: float
Online SVR parameter: Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’. (default:
0.0
)- tol: float
Online SVR parameter: Tolerance for stopping criterion. (default:
0.001
)- stabilized: boolean
Online SVR parameter: If stabilization should be used. (default:
true
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the NoveltySVR algorithm.- Return type
timeeval.algorithms.numenta_htm¶
- timeeval.algorithms.numenta_htm(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
NumentaHTM
Implementation of https://doi.org/10.1016/j.neucom.2017.04.070
Algorithm Parameters:
- encoding_input_width: int
(default:
21
)- encoding_output_width: int
(default:
50
)- autoDetectWaitRecords: int
(default:
50
)- columnCount: int
Number of cell columns in the cortical region (same number for SP and TM) (default:
2048
)- numActiveColumnsPerInhArea: int
Maximum number of active columns in the SP region’s output (when there are more, the weaker ones are suppressed) (default:
40
)- potentialPct: float
What percent of the columns’s receptive field is available for potential synapses. At initialization time, we will choose potentialPct * (2*potentialRadius+1)^2 (default:
0.5
)- synPermConnected: float
The default connected threshold. Any synapse whose permanence value is above the connected threshold is a “connected synapse”, meaning it can contribute to the cell’s firing. Typical value is 0.10. Cells whose activity level before inhibition falls below minDutyCycleBeforeInh will have their own internal synPermConnectedCell threshold set below this default value. (default:
0.1
)- synPermActiveInc: float
(default:
0.1
)- synPermInactiveDec: float
(default:
0.005
)- cellsPerColumn: int
The number of cells (i.e., states), allocated per column. (default:
32
)- inputWidth: int
(default:
2048
)- newSynapseCount: int
New Synapse formation count (default:
20
)- maxSynapsesPerSegment: int
Maximum number of synapses per segment (default:
32
)- maxSegmentsPerCell: int
Maximum number of segments per cell (default:
128
)- initialPerm: float
Initial Permanence (default:
0.21
)- permanenceInc: float
Permanence Increment (default:
0.1
)- permanenceDec: float
Permanence Decrement (default:
0.1
)- globalDecay: float
(default:
0.0
)- maxAge: int
(default:
0
)- minThreshold: int
Minimum number of active synapses for a segment to be considered during search for the best-matching segments. (default:
9
)- activationThreshold: int
Segment activation threshold. A segment is active if it has >= tpSegmentActivationThreshold connected synapses that are active due to infActiveState (default:
12
)- pamLength: int
“Pay Attention Mode” length. This tells the TM how many new elements to append to the end of a learned sequence at a time. Smaller values are better for datasets with short sequences, higher values are better for datasets with long sequences. (default:
1
)- alpha: float
This controls how fast the classifier learns/forgets. Higher values make it adapt faster and forget older patterns faster (default:
0.5
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the NumentaHTM algorithm.- Return type
timeeval.algorithms.ocean_wnn¶
- timeeval.algorithms.ocean_wnn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
OceanWNN
Implementation of https://doi.org/10.1016/j.oceaneng.2019.106129
Algorithm Parameters:
- train_window_size: int
Window size used for forecasting the next point (default:
20
)- hidden_size: int
Number of neurons in hidden layer (default:
20
)- batch_size: int
Number of instances trained at the same time (default:
64
)- test_batch_size: int
Batch size over test and validation dataset (default:
256
)- epochs: int
Number of training iterations over entire dataset; recommended value: 1000 (default:
1
)- split: float
Train-validation split for early stopping (default:
0.8
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.01
)- wavelet_a: float
WBF scale parameter; recommended range: [-2.5, 2.5] (default:
-2.5
)- wavelet_k: float
WBF shift parameter; recommended range: [-1.5, 1.5] (default:
-1.5
)- wavelet_wbf: enum[mexican_hat,central_symmetric,morlet]
Mother WBF; allowed values: “mexican_hat”, “central_symmetric”, “morlet” (default:
mexican_hat
)- wavelet_cs_C: float
Cosine factor for central-symmetric WBF. (default:
1.75
)- threshold_percentile: float
Upper percentile of training residual distribution used for detection replacement. (default:
0.99
)- random_state: int
Seed for the random number generator (default:
42
)- with_threshold: boolean
If true, values whose forecasting error exceeds the threshold are not included in next window, but are replaced by the prediction. (default:
true
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the OceanWNN algorithm.- Return type
timeeval.algorithms.omnianomaly¶
- timeeval.algorithms.omnianomaly(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
OmniAnomaly
Implementation of https://doi.org/10.1145/3292500.3330672
Algorithm Parameters:
- latent_size: int
Reduced dimension size (default:
3
)- rnn_hidden_size: int
Size of RNN hidden layer (default:
500
)- window_size: int
Sliding window size (default:
100
)- linear_hidden_size: int
Dense layer size (default:
500
)- nf_layers: int
NF layer size (default:
20
)- epochs: int
Number of training passes over entire dataset (default:
10
)- split: float
Train-validation split (default:
0.8
)- batch_size: int
Number of datapoints fitted parallel (default:
50
)- l2_reg: float
Regularization factor (default:
0.0001
)- learning_rate: float
Learning Rate for Adam Optimizer (default:
0.001
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the OmniAnomaly algorithm.- Return type
timeeval.algorithms.pcc¶
- timeeval.algorithms.pcc(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
PCC
Implementation of http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=003008C2CF2373B9C332D4A1DB035515?doi=10.1.1.66.299.
Algorithm Parameters:
- n_components: int
Number of components to keep. If n_components is not set all components are kept: n_components == min(n_samples, n_features). (default:
None
)- n_selected_components: int
Number of selected principal components for calculating the outlier scores. It is not necessarily equal to the total number of the principal components. If not set, use all principal components. (default:
None
)- whiten: boolean
When True the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions. (default:
false
)- svd_solver: enum[auto,full,arpack,randomized]
‘auto’: the solver is selected by a default policy based on X.shape and n_components. If the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards. ‘full’: run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing. ‘arpack’: run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < X.shape[1]. ‘randomized’: run randomized SVD by the method of Halko et al. (default:
auto
)- tol: float
Tolerance for singular values computed by svd_solver == ‘arpack’. (default:
0.0
)- max_iter: int
Number of iterations for the power method computed by svd_solver == ‘randomized’. (default:
None
)- random_state: int
Used when svd_solver == ‘arpack’ or svd_solver == ‘randomized’ to seed random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the PCC algorithm.- Return type
timeeval.algorithms.pci¶
- timeeval.algorithms.pci(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
PCI
Implementation of https://doi.org/10.1155/2014/879736
Algorithm Parameters:
- window_size: int
The algorithm uses windows around the current points to predict that point (k points before and k after, where k = window_size // 2). The difference between real and predicted value is used as anomaly score. The parameter window_size acts as a kind of smoothing factor. The bigger the window_size, the smoother the predictions, the more values have big errors. If window_size is too small, anomalies might not be found. window_size should correlate with anomaly window sizes. (default:
20
)- thresholding_p: float
This parameter is only needed if the algorithm should decide itself whether a point is an anomaly. It treats p as a confidence coefficient. It’s the t-statistics confidence coefficient. The smaller p is, the bigger is the confidence interval. If p is too small, anomalies might not be found. If p is too big, too many points might be labeled anomalous. (default:
0.05
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the PCI algorithm.- Return type
timeeval.algorithms.phasespace_svm¶
- timeeval.algorithms.phasespace_svm(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
PhaseSpace-SVM
Implementation of https://doi.org/10.1109/IJCNN.2003.1223670.
Algorithm Parameters:
- embed_dim_range: List[int]
List of phase space dimensions (sliding window sizes). For each dimension a OC-SVM is fitted to calculate outlier scores. The final result is the point-wise aggregation of the anomaly scores. (default:
[50, 100, 150]
)- project_phasespace: boolean
Whether to use phasespace projection or just work on the phasespace values. (default:
False
)- nu: float
Main parameter of OC-SVM. An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. (default:
0.5
)- kernel: enum[linear,poly,rbf,sigmoid]
Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, or ‘sigmoid’. (default:
rbf
)- gamma: float
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is not set (null) then it uses 1 / (n_features * X.var()) as value of gamma (default:
None
)- degree: int
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. (default:
3
)- coef0: float
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’. (default:
0.0
)- tol: float
Tolerance for stopping criterion. (default:
0.001
)- random_state: int
Seed for random number generation. (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the PhaseSpace-SVM algorithm.- Return type
timeeval.algorithms.pst¶
- timeeval.algorithms.pst(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
PST
Implementation of a modified version (with preceding discretization) of https://doi.org/10.1137/1.9781611972764.9.
Algorithm Parameters:
- window_size: int
Length of the subsequences in which the time series should be splitted into (sliding window). (default:
5
)- max_depth: int
Maximal depth of the PST. Default to maximum length of the sequence(s) in object minus 1. (default:
4
)- n_min: int
Minimum number of occurences of a string to add it in the tree. (default:
1
)- y_min: float
Smoothing parameter for conditional probabilities, assuring that nosymbol, and hence no sequence, is predicted to have a null probability. The parameter $ymin$ sets a lower bound for a symbol’s probability. (default:
None
)- n_bins: int
Number of Bags (bins) in which the time-series should be splitted by frequency. (default:
5
)- sim: enum[SIMo,SIMn]
The similarity measure to use when computing the similarity between a sequence and the pst. SIMn is supposed to yield better results. (default:
SIMn
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the PST algorithm.- Return type
timeeval.algorithms.random_black_forest¶
- timeeval.algorithms.random_black_forest(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Random Black Forest (RR)
An ensemble of multiple multi-output random forest regressors based on different feature subsets (requested by RollsRoyce). The forecasting error is used as anomaly score.
Algorithm Parameters:
- train_window_size: int
Size of the training windows. Always predicts a single point! (default:
50
)- n_estimators: int
The number of forests. Each forest is trained on max_features features. (default:
2
)- max_features_per_estimator: float
Each forest is trained on randomly selected int(max_features * n_features) features. (default:
0.5
)- n_trees: int
The number of trees in the forest. (default:
100
)- max_features_method: enum[auto,sqrt,log2]
The number of features to consider when looking for the best split between trees: ‘auto’: max_features=n_features, ‘sqrt’: max_features=sqrt(n_features), ‘log2’: max_features=log2(n_features). (default:
auto
)- bootstrap: boolean
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. (default:
True
)- max_samples: float
If bootstrap is True, the number of samples to draw from X to train each base estimator. (default:
None
)- random_state: int
Seeds the randomness of the bootstrapping and the sampling of the features. (default:
42
)- verbose: int
Controls logging verbosity. (default:
0
)- n_jobs: int
The number of jobs to run in parallel. -1 means using all processors (default:
1
)- max_depth: int
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. (default:
None
)- min_samples_split: int
The minimum number of samples required to split an internal node. (default:
2
)- min_samples_leaf: int
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Random Black Forest (RR) algorithm.- Return type
timeeval.algorithms.robust_pca¶
- timeeval.algorithms.robust_pca(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
RobustPCA
Implementation of https://arxiv.org/pdf/1801.01571.pdf
Algorithm Parameters:
- max_iter: int
Defines the number of maximum robust PCA iterations for solving matrix decomposition. (default:
1000
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the RobustPCA algorithm.- Return type
timeeval.algorithms.s_h_esd¶
- timeeval.algorithms.s_h_esd(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
S-H-ESD (Twitter)
Implementation of http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=003008C2CF2373B9C332D4A1DB035515?doi=10.1.1.66.299
Algorithm Parameters:
- max_anomalies: float
expected maximum relative frequency of anomalies in the dataset (default:
0.05
)- timestamp_unit: enum[m,h,d]
If the index column (‘timestamp’) is of type integer, this gives the unit for date conversion. A unit less than seconds is not supported by S-H-ESD! (default:
m
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the S-H-ESD (Twitter) algorithm.- Return type
timeeval.algorithms.sand¶
- timeeval.algorithms.sand(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
SAND
Implementation of SAND described in http://www.vldb.org/pvldb/vol14/p1717-boniol.pdf.
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- anomaly_window_size: int
Size of the anomalous pattern; sliding windows for clustering and preprocessing are of size 3*anomaly_window_size. (default:
75
)- n_clusters: int
Number of clusters used in Kshape that are maintained iteratively as a normal model (default:
6
)- n_init_train: int
Number of points to build the initial model (may contain anomalies) (default:
2000
)- iter_batch_size: int
Number of points for each batch. Mostly impacts performance (not too small). (default:
500
)- alpha: float
Weight decay / forgetting factor. Quite robust (default:
0.5
)- random_state: int
Seed for random number generation. (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the SAND algorithm.- Return type
timeeval.algorithms.sarima¶
- timeeval.algorithms.sarima(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
SARIMA
Implementation of SARIMA method described in https://milets18.github.io/papers/milets18_paper_19.pdf.
Algorithm Parameters:
- train_window_size: int
Number of points from the beginning of the series to build model on. (default:
500
)- prediction_window_size: int
Number of points to forecast in one go; smaller = slower, but more accurate. (default:
10
)- max_lag: int
Refit SARIMA model after that number of points (only helpful if fixed_orders=None) (default:
None
)- period: int
Periodicity (number of periods in season), often it is 4 for quarterly data or 12 for monthly data. Default is no seasonal effect (==1). Must be >= 1. (default:
1
)- max_iter: int
The maximum number of function evaluations. smaller = faster, but might not converge. (default:
20
)- exhaustive_search: boolean
Performs full grid search to find optimal SARIMA-model without considering statistical tests on the data –> SLOW! but finds the optimal model. (default:
false
)- n_jobs: int
The number of parallel jobs to run for grid search. If
-1
, then the number of jobs is set to the number of CPU cores. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the SARIMA algorithm.- Return type
timeeval.algorithms.series2graph¶
- timeeval.algorithms.series2graph(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Series2Graph
Implementation of https://doi.org/10.14778/3407790.3407792.
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- window_size: Int
Size of the sliding window (paper: l), independent of anomaly length, but should in the best case be larger. (default:
50
)- query_window_size: Int
Size of the sliding windows used to find anomalies (query subsequences). query_window_size must be >= window_size! (paper: l_q) (default:
75
)- rate: Int
Number of angles used to extract pattern nodes. A higher value will lead to high precision, but at the cost of increased computation time. (paper: r performance parameter) (default:
30
)- random_state: Int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Series2Graph algorithm.- Return type
timeeval.algorithms.sr¶
- timeeval.algorithms.sr(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Spectral Residual (SR)
Implementation of https://doi.org/10.1145/3292500.3330680
Algorithm Parameters:
- mag_window_size: int
Window size for sliding window average calculation (default:
3
)- score_window_size: int
Window size for anomaly scoring (default:
40
)- window_size: int
Sliding window size (default:
50
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Spectral Residual (SR) algorithm.- Return type
timeeval.algorithms.sr_cnn¶
- timeeval.algorithms.sr_cnn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
SR-CNN
Implementation of https://doi.org/10.1145/3292500.3330680
Algorithm Parameters:
- window_size: int
Sliding window size (default:
128
)- random_state: int
Seed for random number generators (default:
42
)- step: int
stride size for training data generation (default:
64
)- num: int
Max value for generated data (default:
10
)- learning_rate: float
Gradient factor during SGD training (default:
1e-06
)- epochs: int
Number of training passes over entire dataset (default:
1
)- batch_size: int
Number of data points trained in parallel (default:
256
)- n_jobs: int
Number of processes used during training (default:
1
)- split: float
Train-validation split for early stopping (default:
0.9
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the SR-CNN algorithm.- Return type
timeeval.algorithms.ssa¶
- timeeval.algorithms.ssa(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
SSA
Segmented Sequence Analysis calculates two piecewise linear models, aligns them and then computes the similarity between them. Finally a treshhold based approach is used to classify data as anomalous.
Warning
The implementation of this algorithm is not publicly available (closed source). Thus, TimeEval will fail to download the Docker image and the algorithm will not be available. Please contact the authors of the algorithm for the implementation and build the algorithm Docker image yourself.
Algorithm Parameters:
- ep: int
Score normalization value (default:
3
)- window_size: int
Size of sliding window. (default:
20
)- rf_method: Enum[all,alpha]
all: Directly calculate reference timeseries from all points. alpha: Create weighted reference timeseries with help of parameter ‘a’ (default:
alpha
)- alpha: float
Describes weights that are used for reference time series creation. Can be a single weight(float) or an array of weights. So far only supporting a single value (default:
0.2
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the SSA algorithm.- Return type
timeeval.algorithms.stamp¶
- timeeval.algorithms.stamp(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
STAMP
Implementation of https://doi.org/10.1109/ICDM.2016.0179.
Algorithm Parameters:
- anomaly_window_size: Int
Size of the sliding window. (default:
30
)- exclusion_zone: Float
Size of the exclusion zone as a factor of the window_size. This prevents self-matches. (default:
0.5
)- verbose: Int
Controls logging verbosity. (default:
1
)- n_jobs: Int
The number of jobs to run in parallel. -1 is not supported, defaults back to serial implementation. (default:
1
)- random_state: Int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the STAMP algorithm.- Return type
timeeval.algorithms.stomp¶
- timeeval.algorithms.stomp(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
STOMP
Implementation of https://doi.org/10.1109/ICDM.2016.0085.
Algorithm Parameters:
- anomaly_window_size: Int
Size of the sliding window. (default:
30
)- exclusion_zone: Float
Size of the exclusion zone as a factor of the window_size. This prevents self-matches. (default:
0.5
)- verbose: Int
Controls logging verbosity. (default:
1
)- n_jobs: Int
The number of jobs to run in parallel. -1 is not supported, defaults back to serial implementation. (default:
1
)- random_state: Int
Seed for random number generation. (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the STOMP algorithm.- Return type
timeeval.algorithms.subsequence_fast_mcd¶
- timeeval.algorithms.subsequence_fast_mcd(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Subsequence Fast-MCD
Implementation of https://doi.org/10.2307/1270566 with sliding windows as input
Algorithm Parameters:
- store_precision: boolean
Specify if the estimated precision is stored (default:
True
)- support_fraction: float
The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: (n_sample + n_features + 1) / 2. The parameter must be in the range (0, 1). (default:
None
)- random_state: int
Determines the pseudo random number generator for shuffling the data. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Subsequence Fast-MCD algorithm.- Return type
timeeval.algorithms.subsequence_if¶
- timeeval.algorithms.subsequence_if(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Subsequence IF
Isolation Forest on sliding windows to detect subsequence anomalies.
Algorithm Parameters:
- window_size: int
Size of the sliding windows to extract subsequences as input to LOF. (default:
100
)- n_trees: int
The number of decision trees (base estimators) in the forest (ensemble). (default:
100
)- max_samples: float
The number of samples to draw from X to train each base estimator: max_samples * X.shape[0]. If unspecified (null), then max_samples=min(256, n_samples). (default:
None
)- max_features: float
The number of features to draw from X to train each base estimator: max_features * X.shape[1]. (default:
1.0
)- bootstrap: boolean
If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed. (default:
false
)- random_state: int
Seed for random number generation. (default:
42
)- verbose: int
Controls the verbosity of the tree building process logs. (default:
0
)- n_jobs: int
The number of jobs to run in parallel. If -1, then the number of jobs is set to the number of cores. (default:
1
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Subsequence IF algorithm.- Return type
timeeval.algorithms.subsequence_knn¶
- timeeval.algorithms.subsequence_knn(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Sub-KNN
KNN on sliding windows to detect subsequence anomalies.
Algorithm Parameters:
- window_size: int
Size of the sliding windows to extract subsequences as input to LOF. (default:
100
)- n_neighbors: int
Number of neighbors to use by default for kneighbors queries. (default:
5
)- leaf_size: int
Leaf size passed to BallTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. (default:
30
)- method: enum[largest,mean,median]
‘largest’: use the distance to the kth neighbor as the outlier score, ‘mean’: use the average of all k neighbors as the outlier score, ‘median’: use the median of the distance to k neighbors as the outlier score. (default:
largest
)- radius: float
Range of parameter space to use by default for radius_neighbors queries. (default:
1.0
)- distance_metric_order: int
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances. (default:
2
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Sub-KNN algorithm.- Return type
timeeval.algorithms.subsequence_lof¶
- timeeval.algorithms.subsequence_lof(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Subsequence LOF
LOF on sliding windows to detect subsequence anomalies.
Algorithm Parameters:
- window_size: int
Size of the sliding windows to extract subsequences as input to LOF. (default:
100
)- n_neighbors: int
Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used. (default:
20
)- leaf_size: int
Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. (default:
30
)- distance_metric_order: int
Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances. (default:
2
)- n_jobs: int
The number of parallel jobs to run for neighbors search. If
-1
, then the number of jobs is set to the number of CPU cores. Affects only kneighbors and kneighbors_graph methods. (default:1
)- random_state: int
Seed for random number generation. (default:
42
)- use_column_index: int
The column index to use as input for the univariate algorithm for multivariate datasets. The selected single channel of the multivariate time series is analyzed by the algorithms. The index is 0-based and does not include the index-column (‘timestamp’). The single channel of an univariate dataset, therefore, has index 0. (default:
0
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Subsequence LOF algorithm.- Return type
timeeval.algorithms.tanogan¶
- timeeval.algorithms.tanogan(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
TAnoGan
Implementation of http://arxiv.org/abs/2008.09567
Algorithm Parameters:
- epochs: int
Number of training iterations over entire dataset (default:
1
)- cuda: boolean
Set to true, if the GPU-backend (using CUDA) should be used. Otherwise, the algorithm is executed on the CPU. (default:
false
)- window_size: int
Size of the sliding windows (default:
30
)- learning_rate: float
Learning rate for Adam optimizer (default:
0.0002
)- batch_size: int
Number of instances trained at the same time (default:
32
)- n_jobs: int
Number of workers (processes) used to load and preprocess the data (default:
1
)- random_state: int
Seed for random number generation. (default:
42
)- early_stopping_patience: int
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
10
)- early_stopping_delta: float
If 1 - (loss / last_loss) is less than delta for patience epochs, stop (default:
0.05
)- split: float
Train-validation split for early stopping (default:
0.8
)- iterations: int
Number of test iterations per window (default:
25
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the TAnoGan algorithm.- Return type
timeeval.algorithms.tarzan¶
- timeeval.algorithms.tarzan(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
TARZAN
Implementation of https://dl.acm.org/doi/10.1145/775047.775128
Algorithm Parameters:
- random_state: int
Seed for random number generation. (default:
42
)- anomaly_window_size: int
Size of the sliding window. Equal to the discord length! (default:
20
)- alphabet_size: int
Number of symbols used for discretization by SAX (performance parameter) (default:
4
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the TARZAN algorithm.- Return type
timeeval.algorithms.telemanom¶
- timeeval.algorithms.telemanom(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Telemanom
Implementation of https://doi.org/10.1145/3219819.3219845.
Algorithm Parameters:
- batch_size: Int
number of values to evaluate in each batch (default:
70
)- smoothing_window_size: Int
number of trailing batches to use in error calculation (default:
30
)- smoothing_perc: Float
etermines window size used in EWMA smoothing (percentage of total values for channel) (default:
0.05
)- error_buffer: Int
number of values surrounding an error that are brought into the sequence (promotes grouping on nearby sequences) (default:
100
)- dropout: Float
LSTM dropout probability (default:
0.3
)- lstm_batch_size: Int
number of vlaues to evaluate in one batch for the LSTM (default:
64
)- epochs: Int
Number of training iterations over entire dataset (default:
35
)- split: Float
Train-validation split for early stopping (default:
0.8
)- early_stopping_patience: Int
If loss is delta or less smaller for patience epochs, stop (default:
10
)- early_stopping_delta: Float
If loss is delta or less smaller for patience epochs, stop (default:
0.0003
)- window_size: Int
num previous timesteps provided to model to predict future values (default:
250
)- prediction_window_size: Int
number of steps to predict ahead (default:
10
)- p: Float
minimum percent decrease between max errors in anomalous sequences (used for pruning) (default:
0.13
)- random_state: int
Seed for the random number generator (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Telemanom algorithm.- Return type
timeeval.algorithms.torsk¶
- timeeval.algorithms.torsk(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Torsk
Implementation of http://arxiv.org/abs/1909.01709
Algorithm Parameters:
- input_map_size: int
Size of the random weight preprocessing latent space. input_map_size must be larger than or equal to context_window_size! (default:
100
)- input_map_scale: float
Feature scaling of the random weight preprocessing. (default:
0.125
)- context_window_size: int
Size of a tumbling window used to encode the time series into a 2D (image-based) representation, called slices (default:
10
)- train_window_size: int
Torsk creates the input subsequences by sliding a window of size train_window_size + prediction_window_size + 1 over the slices with shape (context_window_size, dim). train_window_size represents the size of the input windows for training and prediction (default:
50
)- prediction_window_size: int
Torsk creates the input subsequences by sliding a window of size train_window_size + prediction_window_size + 1 over the slices with shape (context_window_size, dim). prediction_window_size represents the size of the ESN predictions, should be min_anomaly_length < prediction_window_size < 10 * min_anomaly_length (default:
20
)- transient_window_size: int
Just a part of the training window, the first transient_window_size slices, are used for the ESN optimization. (default:
10
)- spectral_radius: float
ESN hyperparameter that determines the influence of previous internal ESN state on the next one. spectral_radius > 1.0 increases non-linearity, but decreases short-term-memory capacity (maximized at 1.0) (default:
2.0
)- density: float
Density of the ESN cell, where approx. density percent of elements being non-zero (default:
0.01
)- reservoir_representation: enum[sparse,dense]
Representation of the ESN reservoirs. sparse is significantly faster than dense (default:
sparse
)- imed_loss: boolean
Calculate loss on spatially aware (image-based) data representation instead of flat arrays (default:
False
)- train_method: enum[pinv_lstsq,pinv_svd,tikhonov]
Solver used to train the ESN. tikhonov - linear solver with tikhonov regularization, pinv_lstsq - exact least-squares-solver that may lead to a numerical blowup, pinv_svd - SVD-based least-squares-solver that is highly numerically stable, but approximate (default:
pinv_svd
)- tikhonov_beta: float
Parameter of the Tikhonov regularization term when train_method = tikhonov is used. (default:
None
)- verbose: int
Controls the logging output (default:
2
)- scoring_small_window_size: int
Size of the smaller of two windows slid over the prediction errors to calculate the final anomaly scores. (default:
10
)- scoring_large_window_size: int
Size of the larger of two windows slid over the prediction errors to calculate the final anomaly scores. (default:
100
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Torsk algorithm.- Return type
timeeval.algorithms.triple_es¶
- timeeval.algorithms.triple_es(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
Triple ES (Holt-Winter’s)
Implementation of http://www.diva-portal.org/smash/get/diva2:1198551/FULLTEXT02.pdf
Algorithm Parameters:
- train_window_size: int
size of each TripleES model to predict the next timestep (default:
200
)- period: int
number of time units at which events happen regularly/periodically (default:
100
)- trend: enum[add, mul]
type of trend component (default:
add
)- seasonal: enum[add, mul]
type of seasonal component (default:
add
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the Triple ES (Holt-Winter’s) algorithm.- Return type
timeeval.algorithms.ts_bitmap¶
- timeeval.algorithms.ts_bitmap(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
TSBitmap
Implementation of https://dl.acm.org/doi/abs/10.5555/1116877.1116907
Algorithm Parameters:
- feature_window_size: int
Size of the tumbling windows used for SAX discretization. (default:
100
)- lead_window_size: int
How far to look ahead to create lead bitmap. (default:
200
)- lag_window_size: int
How far to look back to create the lag bitmap. (default:
300
)- alphabet_size: int
Number of bins for SAX discretization. (default:
5
)- level_size: int
Desired level of recursion of the bitmap. (default:
3
)- compression_ratio: int
How much to compress the timeseries in the PAA step. If compression_ration == 1, no compression. (default:
2
)- random_state: int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the TSBitmap algorithm.- Return type
timeeval.algorithms.valmod¶
- timeeval.algorithms.valmod(params: Optional[ParameterConfig] = None, skip_pull: bool = False, timeout: Optional[Duration] = None) Algorithm ¶
VALMOD
Implementation of https://doi.org/10.1007/s10618-020-00685-w.
Algorithm Parameters:
- min_anomaly_window_size: Int
Minimum sliding window size (default:
30
)- max_anomaly_window_size: Int
Maximum sliding window size (default:
40
)- heap_size: Int
Size of the distance profile heap buffer (default:
50
)- exclusion_zone: Float
Size of the exclusion zone as a factor of the window_size. This prevents self-matches. (default:
0.5
)- verbose: Int
Controls logging verbosity. (default:
1
)- random_state: Int
Seed for random number generation. (default:
42
)
- Parameters
params (
Optional[ParameterConfig]
) – Parameter configuration for the algorithmskip_pull (
bool
) – Set toTrue
to skip pulling the Docker image and use a local image instead. If the image is not present locally, this will raise an error.timeout (
Optional[Duration]
) – Set an individual execution and training timeout for this algorithm. This will overwrite the global timeouts set usingResourceConstraints
.
- Returns
A correctly configured
Algorithm
object for the VALMOD algorithm.- Return type
timeeval.datasets package¶
timeeval.datasets.analyzer¶
- class timeeval.datasets.analyzer.DatasetAnalyzer(dataset_id: Tuple[str, str], is_train: bool, df: Optional[DataFrame] = None, dataset_path: Optional[Path] = None, dmgr: Optional[Datasets] = None, ignore_stationarity: bool = False, ignore_trend: bool = False)¶
Utility class to analyze a dataset and infer metadata about the dataset.
Use this class to compute necessary metadata from a time series. The computation is started directly when instantiating this class. You can access the results using the property
metadata
. There are multiple ways to instantiate this class, but you always have to specify the dataset ID, because it is part of the metadata:Use an existing pandas data frame object. Supply a value to the parameter df.
Use a path to a time series. Supply a value to the parameter dataset_path.
Use a dataset ID and a reference to the dataset manager. Supply a value to the parameter dmgr.
This class computes simple metadata, such as number of anomalies, mean, and standard deviation, as well as advanced metadata, such as trends or stationarity information for all time series channels. The simple metadata is exact. But the advanced metadata is estimated based on the observed time series data. The trend is computed by fitting linear regression models of different order to the time series. If the regression has a high enough correlation with the observed values, the trends and their confidence are recorded. The stationarity of the time series is estimated using two statistical tests, the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) and the Augmented Dickey Fuller (ADF) test.
The metadata of a dataset can be stored to disk. This class provides utility functions to create a JSON-file per dataset, containing the metadata about the test time series and the optional training time series.
- Parameters
dataset_id (
tuple
ofstr
,str
) – ID of the dataset consisting of collection and dataset name.is_train (
bool
) – If the analyzed time series is the testing or training time series of the dataset.df (
data frame
, optional) – Time series data frame. If df is supplied, you can omit dataset_path and dmgr.dataset_path (
path
, optional) – Path to the time series. If dataset_path is supplied, you can omit df and dmgr.dmgr (
Datasets object
, optional) – Dataset manager instance that is used to load the time series if df and dataset_path are not specified.ignore_stationarity (
bool
, optional) – Don’t estimate the time series’ channels stationarity. This might be necessary for large datasets, because this step takes a lot of time.ignore_trend (
bool
, optional) – Don’t estimate the time series’ channels trend type. This might be necessary for large datasets, because this step takes a lot of time.
- static load_from_json(filename: Union[str, Path], train: bool = False) DatasetMetadata ¶
Loads existing time series metadata from disk.
If there are multiple metadata entries with the same dataset ID and training/testing-label, the first entry is used.
- Parameters
filename (
path
) – Path to the JSON-file containing the dataset metadata. Can be written usingtimeeval.datasets.analyzer.DatasetAnalyzer.save_to_json()
.train (
bool
) – Whether the training or testing time series’ metadata should be loaded from the file.
- Returns
metadata – Metadata of the training or testing time series.
- Return type
time series metadata object
- property metadata: DatasetMetadata¶
Returns the computed metadata about the time series.
- save_to_json(filename: Union[str, Path], overwrite: bool = False) None ¶
Save the computed metadata for a dataset to disk.
This method writes a dataset’s metadata to a JSON-formatted file to disk. The file contains a list of metadata specifications. One specification for the test time series and potentially another one for the train time series. Since the DatasetAnalyzer just analyzes a single time series at a time, this method appends the current metadata to the existing list per default. If you want to overwrite the existing content of the file, you can use the parameter overwrite.
- Parameters
filename (
path
) – Path to the file, where the metadata should be written to. Might already exist.overwrite (
bool
) – If existing data in the file should be overwritten or the current metadata should just be added to it.
timeeval.datasets.custom¶
- class timeeval.datasets.custom.CDEntry(test_path, train_path, details)¶
Bases:
NamedTuple
- class timeeval.datasets.custom.CustomDatasets(dataset_config: Union[str, Path])¶
Bases:
CustomDatasetsBase
Implementation of the custom datasets API.
Internal API! You should not need to use or modify this class.
This class behaves similar to the
timeeval.datasets.datasets.Datasets
-API while using a different internal representation for the dataset index.- select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
timeeval.datasets.custom_base¶
- class timeeval.datasets.custom_base.CustomDatasetsBase¶
Bases:
ABC
API definition for custom datasets.
Internal API! You should not need to use or modify this class.
- abstract select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
timeeval.datasets.custom_noop¶
- class timeeval.datasets.custom_noop.NoOpCustomDatasets¶
Bases:
CustomDatasetsBase
Dummy implementation of the CustomDatasets interface.
Internal API! You should not need to use or modify this class.
This dummy implementation does nothing and improves readability of the
timeeval.datasets.datasets.Datasets
-implementation by removing the need for None-checks.- select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
timeeval.datasets.dataset¶
- class timeeval.datasets.dataset.Dataset(datasetId: Tuple[str, str], dataset_type: str, training_type: TrainingType, length: int, dimensions: int, contamination: float, min_anomaly_length: int, median_anomaly_length: int, max_anomaly_length: int, period_size: Optional[int] = None, num_anomalies: Optional[int] = None)¶
Bases:
object
Dataset information containing basic metadata about the dataset.
This class is used within TimeEval heuristics to determine the heuristic values based on the dataset properties.
- property input_dimensionality: InputDimensionality¶
- training_type: TrainingType¶
timeeval.datasets.dataset_manager¶
- class timeeval.datasets.dataset_manager.DatasetManager(data_folder: Union[str, Path], custom_datasets_file: Optional[Union[str, Path]] = None, create_if_missing: bool = True)¶
Bases:
ContextManager
[DatasetManager
],Datasets
Manages benchmark datasets and their meta-information.
Manages dataset collections and their meta-information that are stored in a single folder with an index file. You can also use this class to create a new TimeEval dataset collection.
Warning
ATTENTION: Not multi-processing-safe! There is no check for changes to the underlying dataset.csv file while this class is loaded.
Read-only access is fine with multiple processes.
- Parameters
data_folder (
path
) – Path to the folder, where the benchmark data is stored. This folder consists of the file datasets.csv and the datasets in a hierarchical storage layout.custom_datasets_file (
path
) – Path to a file listing additional custom datasets.create_if_missing (
bool
) – Create an index-file in thedata_folder
if none could be found. Set this toFalse
if an exception should be raised if the folder is wrong or does not exist.
- Raises
FileNotFoundError – If
create_if_missing
is set toFalse
and no datasets.csv-file was found in thedata_folder
.
See also
timeeval.datasets.datasets.Datasets
,timeeval.datasets.multi_dataset_manager.MultiDatasetManager
- add_dataset(dataset: DatasetRecord) None ¶
Adds a new dataset to the benchmark dataset collection (in-memory).
The provided dataset metadata is added to this dataset collection (to the in-memory index). You can save the in-memory index to disk using the
timeeval.datasets.DatasetManager.save()
-method. The referenced time series files (training and testing paths) are not touched. If the same dataset ID (collection_name, dataset_name) than an existing dataset is specified, its entries are overwritten!- Parameters
dataset (
DatasetRecord object
) – The dataset information to add to the benchmark collection.
- add_datasets(datasets: List[DatasetRecord]) None ¶
Add a list of datasets to the dataset collection.
Add a list of new datasets to the benchmark dataset collection (in-memory). Already existing keys are overwritten!
- Parameters
datasets (
list
ofDatasetRecord objects
) – List of dataset metdata to add to this dataset collection.
- df() DataFrame ¶
Returns a copy of the internal dataset metadata collection.
The DataFrame has the following schema:
- Index:
dataset_name, collection_name
- Columns:
train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size
- Returns
df – All custom and benchmark datasets and their metadata.
- Return type
data frame
- get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset ¶
Returns dataset metadata.
Examples
>>> from timeeval.datasets import DatasetManager >>> dm = DatasetManager("path/to/datasets") >>> dataset_id = ("custom", "dataset1")
Access using the dataset ID:
>>> dm.get(dataset_id) Dataset(datsetId=("custom", "dataset1"), ...)
Access using collection and dataset name:
>>> dm.get("custom", "dataset1") Dataset(datsetId=("custom", "dataset1"), ...)
- get_collection_names() List[str] ¶
Returns the unique dataset collection names (includes custom datasets if present).
- get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame ¶
Loads the training/testing time series as a data frame.
- Parameters
- Returns
df – The training or testing time series as a
pandas.DataFrame
.- Return type
data frame
- get_dataset_names() List[str] ¶
Returns the unique dataset names (includes custom datasets if present).
- get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray ¶
Loads the training/testing time series as an multi-dimensional array.
- get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path ¶
Returns the path to the training/testing time series of the dataset.
- get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata ¶
Computes detailed metadata about the training or testing time series of a dataset.
For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using
timeeval.datasets.DatasetAnalyzer
and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:Information about the training time series, if
train=True
is specified.Mean, variance, trend, and stationarity information for each channel of the time series individually.
- Parameters
- Returns
metadata – Detailed metadata about the training or testing time series.
- Return type
dataset metadata object
See also
timeeval.datasets.DatasetAnalyzer
Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata
Data class of the returned result.
- get_training_type(dataset_id: Tuple[str, str]) TrainingType ¶
Returns the training type of a specific dataset.
- Parameters
dataset_id (
tuple
ofstr
,str
) – Dataset ID (collection and dataset name)- Returns
training_type – Either unsupervised, semi-supervised, or supervised.
- Return type
TrainingType enum
See also
timeeval.TrainingType
Enumeration of training types that could be returned by this method.
- load_custom_datasets(file_path: Union[str, Path]) None ¶
Reads a configuration file that contains additional datasets and adds them to the current dataset index.
You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:
{ "dataset_name": { "test_path": "./path/to/test.csv", "train_path": "./path/to/train.csv", "type": "synthetic", "period": 10 } }
The properties
train_path
,type
, andperiod
are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to thecustom
dataset collection.Warning
Repeated calls to this method overwrite the existing custom dataset list.
- Parameters
file_path (
path
) – Path to the custom dataset configuration file.
- save() None ¶
Saves the in-memory dataset index to disk.
Persists newly added benchmark datasets from memory to the benchmark dataset collection file
datasets.csv
. Custom datasets are excluded from persistence and cannot be saved to disk; useadd_dataset()
oradd_datasets()
to add datasets to the benchmark dataset collection.
- select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.
- Parameters
collection (
str
) – restrict datasets to a specific collectiondataset (
str
) – restrict datasets to a specific namedataset_type (
str
) – restrict dataset type (e.g. real or synthetic)datetime_index (
bool
) – only select datasets for which a datetime index exists; ifTrue
: “timestamp”-column has datetime values; ifFalse
: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.training_type (
timeeval.TrainingType
) – select datasets for specific training needs: *SUPERVISED
, *SEMI_SUPERVISED
, or *UNSUPERVISED
train_is_normal (
bool
) – ifTrue
: only return datasets for which the training dataset does not contain anomalies; ifFalse
: only return datasets for which the training dataset contains anomaliesinput_dimensionality (
timeeval.InputDimensionality
) – restrict dataset to input type, either univariate or multivariatemin_anomalies (
int
) – restrict datasets to those with a minimum number ofmin_anomalies
anomalous subsequencesmax_anomalies (
int
) – restrict datasets to those with a maximum number ofmax_anomalies
anomalous subsequencesmax_contamination (
int
) – restrict datasets to those having a contamination smaller or equal tomax_contamination
- Returns
dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).
- Return type
List[Tuple[str,str]]
- class timeeval.datasets.dataset_manager.DatasetRecord(collection_name, dataset_name, train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size)¶
Bases:
NamedTuple
- count(value, /)¶
Return number of occurrences of value.
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
timeeval.datasets.datasets¶
- class timeeval.datasets.datasets.Datasets(df: DataFrame, custom_datasets_file: Optional[Union[str, Path]] = None)¶
Bases:
ABC
Provides read-only access to benchmark datasets and their metadata.
This is an abstract class (interface). Please use
timeeval.datasets.dataset_manager.DatasetManager
ortimeeval.datasets.multi_dataset_manager.MultiDatasetManager
instead. The constructor arguments are filled in by the respective implementation.- Parameters
df (
pandas DataFrame
) – Metadata of all loaded datasets.custom_datasets_file (
pathlib.Path
orstr
) – Path to a file listing additional custom datasets.
- df() DataFrame ¶
Returns a copy of the internal dataset metadata collection.
The DataFrame has the following schema:
- Index:
dataset_name, collection_name
- Columns:
train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size
- Returns
df – All custom and benchmark datasets and their metadata.
- Return type
data frame
- get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset ¶
Returns dataset metadata.
Examples
>>> from timeeval.datasets import DatasetManager >>> dm = DatasetManager("path/to/datasets") >>> dataset_id = ("custom", "dataset1")
Access using the dataset ID:
>>> dm.get(dataset_id) Dataset(datsetId=("custom", "dataset1"), ...)
Access using collection and dataset name:
>>> dm.get("custom", "dataset1") Dataset(datsetId=("custom", "dataset1"), ...)
- get_collection_names() List[str] ¶
Returns the unique dataset collection names (includes custom datasets if present).
- get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame ¶
Loads the training/testing time series as a data frame.
- Parameters
- Returns
df – The training or testing time series as a
pandas.DataFrame
.- Return type
data frame
- get_dataset_names() List[str] ¶
Returns the unique dataset names (includes custom datasets if present).
- get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray ¶
Loads the training/testing time series as an multi-dimensional array.
- get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path ¶
Returns the path to the training/testing time series of the dataset.
- get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata ¶
Computes detailed metadata about the training or testing time series of a dataset.
For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using
timeeval.datasets.DatasetAnalyzer
and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:Information about the training time series, if
train=True
is specified.Mean, variance, trend, and stationarity information for each channel of the time series individually.
- Parameters
- Returns
metadata – Detailed metadata about the training or testing time series.
- Return type
dataset metadata object
See also
timeeval.datasets.DatasetAnalyzer
Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata
Data class of the returned result.
- get_training_type(dataset_id: Tuple[str, str]) TrainingType ¶
Returns the training type of a specific dataset.
- Parameters
dataset_id (
tuple
ofstr
,str
) – Dataset ID (collection and dataset name)- Returns
training_type – Either unsupervised, semi-supervised, or supervised.
- Return type
TrainingType enum
See also
timeeval.TrainingType
Enumeration of training types that could be returned by this method.
- load_custom_datasets(file_path: Union[str, Path]) None ¶
Reads a configuration file that contains additional datasets and adds them to the current dataset index.
You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:
{ "dataset_name": { "test_path": "./path/to/test.csv", "train_path": "./path/to/train.csv", "type": "synthetic", "period": 10 } }
The properties
train_path
,type
, andperiod
are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to thecustom
dataset collection.Warning
Repeated calls to this method overwrite the existing custom dataset list.
- Parameters
file_path (
path
) – Path to the custom dataset configuration file.
- abstract refresh(force: bool = False) None ¶
Re-read the benchmark dataset collection information from disk.
- select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.
- Parameters
collection (
str
) – restrict datasets to a specific collectiondataset (
str
) – restrict datasets to a specific namedataset_type (
str
) – restrict dataset type (e.g. real or synthetic)datetime_index (
bool
) – only select datasets for which a datetime index exists; ifTrue
: “timestamp”-column has datetime values; ifFalse
: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.training_type (
timeeval.TrainingType
) – select datasets for specific training needs: *SUPERVISED
, *SEMI_SUPERVISED
, or *UNSUPERVISED
train_is_normal (
bool
) – ifTrue
: only return datasets for which the training dataset does not contain anomalies; ifFalse
: only return datasets for which the training dataset contains anomaliesinput_dimensionality (
timeeval.InputDimensionality
) – restrict dataset to input type, either univariate or multivariatemin_anomalies (
int
) – restrict datasets to those with a minimum number ofmin_anomalies
anomalous subsequencesmax_anomalies (
int
) – restrict datasets to those with a maximum number ofmax_anomalies
anomalous subsequencesmax_contamination (
int
) – restrict datasets to those having a contamination smaller or equal tomax_contamination
- Returns
dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).
- Return type
List[Tuple[str,str]]
timeeval.datasets.metadata¶
- class timeeval.datasets.metadata.DatasetMetadata(dataset_id: Tuple[str, str], is_train: bool, length: int, dimensions: int, contamination: float, num_anomalies: int, anomaly_length: AnomalyLength, means: Dict[str, float], stddevs: Dict[str, float], trends: Dict[str, List[Trend]], stationarities: Dict[str, Stationarity])¶
Bases:
object
Represents the metadata of a single time series of a dataset (for each channel).
- anomaly_length: AnomalyLength¶
- static from_json(s: str) DatasetMetadata ¶
- stationarities: Dict[str, Stationarity]¶
- property stationarity: Stationarity¶
- class timeeval.datasets.metadata.DatasetMetadataEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶
Bases:
JSONEncoder
- default(o: Any) Any ¶
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- class timeeval.datasets.metadata.Stationarity(value)¶
Bases:
Enum
An enumeration.
- DIFFERENCE_STATIONARY = 1¶
- NOT_STATIONARY = 3¶
- STATIONARY = 0¶
- TREND_STATIONARY = 2¶
- static from_name(s: int) Stationarity ¶
- class timeeval.datasets.metadata.Trend(tpe: timeeval.datasets.metadata.TrendType, coef: float, confidence_r2: float)¶
Bases:
object
timeeval.datasets.multi_dataset_manager¶
- class timeeval.datasets.multi_dataset_manager.MultiDatasetManager(data_folders: List[Union[str, Path]], custom_datasets_file: Optional[Union[str, Path]] = None)¶
Bases:
Datasets
Provides read-only access to multiple benchmark datasets collections and their meta-information.
Manages dataset collections and their meta-information that are stored in multiple folders. The entries in all index files must be unique and are NOT allowed to overlap! This would lead to information loss!
- Parameters
data_folders (
list
ofpaths
) – List of data paths that hold the datasets and the index files.custom_datasets_file (
path
) – Path to a file listing additional custom datasets.
- Raises
FileNotFoundError – If the datasets.csv-file was not found in any of the data_folders.
See also
timeeval.datasets.Datasets
,timeeval.datasets.DatasetManager
- df() DataFrame ¶
Returns a copy of the internal dataset metadata collection.
The DataFrame has the following schema:
- Index:
dataset_name, collection_name
- Columns:
train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size
- Returns
df – All custom and benchmark datasets and their metadata.
- Return type
data frame
- get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset ¶
Returns dataset metadata.
Examples
>>> from timeeval.datasets import DatasetManager >>> dm = DatasetManager("path/to/datasets") >>> dataset_id = ("custom", "dataset1")
Access using the dataset ID:
>>> dm.get(dataset_id) Dataset(datsetId=("custom", "dataset1"), ...)
Access using collection and dataset name:
>>> dm.get("custom", "dataset1") Dataset(datsetId=("custom", "dataset1"), ...)
- get_collection_names() List[str] ¶
Returns the unique dataset collection names (includes custom datasets if present).
- get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame ¶
Loads the training/testing time series as a data frame.
- Parameters
- Returns
df – The training or testing time series as a
pandas.DataFrame
.- Return type
data frame
- get_dataset_names() List[str] ¶
Returns the unique dataset names (includes custom datasets if present).
- get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray ¶
Loads the training/testing time series as an multi-dimensional array.
- get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path ¶
Returns the path to the training/testing time series of the dataset.
- get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata ¶
Computes detailed metadata about the training or testing time series of a dataset.
For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using
timeeval.datasets.DatasetAnalyzer
and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:Information about the training time series, if
train=True
is specified.Mean, variance, trend, and stationarity information for each channel of the time series individually.
- Parameters
- Returns
metadata – Detailed metadata about the training or testing time series.
- Return type
dataset metadata object
See also
timeeval.datasets.DatasetAnalyzer
Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata
Data class of the returned result.
- get_training_type(dataset_id: Tuple[str, str]) TrainingType ¶
Returns the training type of a specific dataset.
- Parameters
dataset_id (
tuple
ofstr
,str
) – Dataset ID (collection and dataset name)- Returns
training_type – Either unsupervised, semi-supervised, or supervised.
- Return type
TrainingType enum
See also
timeeval.TrainingType
Enumeration of training types that could be returned by this method.
- load_custom_datasets(file_path: Union[str, Path]) None ¶
Reads a configuration file that contains additional datasets and adds them to the current dataset index.
You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:
{ "dataset_name": { "test_path": "./path/to/test.csv", "train_path": "./path/to/train.csv", "type": "synthetic", "period": 10 } }
The properties
train_path
,type
, andperiod
are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to thecustom
dataset collection.Warning
Repeated calls to this method overwrite the existing custom dataset list.
- Parameters
file_path (
path
) – Path to the custom dataset configuration file.
- select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]] ¶
Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.
- Parameters
collection (
str
) – restrict datasets to a specific collectiondataset (
str
) – restrict datasets to a specific namedataset_type (
str
) – restrict dataset type (e.g. real or synthetic)datetime_index (
bool
) – only select datasets for which a datetime index exists; ifTrue
: “timestamp”-column has datetime values; ifFalse
: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.training_type (
timeeval.TrainingType
) – select datasets for specific training needs: *SUPERVISED
, *SEMI_SUPERVISED
, or *UNSUPERVISED
train_is_normal (
bool
) – ifTrue
: only return datasets for which the training dataset does not contain anomalies; ifFalse
: only return datasets for which the training dataset contains anomaliesinput_dimensionality (
timeeval.InputDimensionality
) – restrict dataset to input type, either univariate or multivariatemin_anomalies (
int
) – restrict datasets to those with a minimum number ofmin_anomalies
anomalous subsequencesmax_anomalies (
int
) – restrict datasets to those with a maximum number ofmax_anomalies
anomalous subsequencesmax_contamination (
int
) – restrict datasets to those having a contamination smaller or equal tomax_contamination
- Returns
dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).
- Return type
List[Tuple[str,str]]
timeeval.heuristics package¶
- timeeval.heuristics.TimeEvalHeuristic(signature: str) TimeEvalParameterHeuristic ¶
Factory function for TimeEval heuristics based on their string-representation.
This wrapper allows using the heuristics by name without the need for imports. It is primarily used in the
timeeval.heuristics.inject_heuristic_values()
function. The following heuristics are currently supported:- Parameters
signature (
str
) – String representation of the heuristic to be created. Must be of the form<heuristic_name>(<heuristic_parameters>)
- Returns
heuristic – The created heuristic object.
- Return type
- timeeval.heuristics.inject_heuristic_values(params: T, algorithm: Algorithm, dataset_details: Dataset, dataset_path: Path) T ¶
This function parses the supplied parameter mapping in
params
and replaces all heuristic definitions with their actual values.The heuristics are generally evaluated in the order they are defined in the parameter mapping. However,
The order within multiple ``ParameterDependenceHeuristic`s
is not defined. If a heuristic returnsNone
, the corresponding parameter is removed from the parameter mapping. Heuristics can be defined by using the following syntax as the parameter value:"heuristic:<heuristic_name>(<heuristic_parameters>)"
Heuristics can use the following information to compute their values:
properties of the algorithm
properties of the dataset
the full dataset (supplied as a path to the dataset)
the (current) parameter mapping (later evaluated heuristics can see the changes of previous heuristics)
- Parameters
params (
T
) – The current parameter mapping, whose values should be updated by the heuristics. If a immutable mapping is passed, no changes will be made.algorithm (
Algorithm
) – The algorithm (Algorithm
) for which the parameter mapping is valid.dataset_details (
Dataset
) – The dataset for which the parameter mapping is supposed to be used.dataset_path (
Path
) – The path to the dataset.
- Returns
params – The updated parameter mapping.
- Return type
T
timeeval.heuristics.TimeEvalParameterHeuristic¶
- class timeeval.heuristics.TimeEvalParameterHeuristic¶
Bases:
ABC
Base class for TimeEval parameter heuristics.
Heuristics are used to calculate parameter values for algorithms based on information about the algorithm, the dataset, or other parameters. They are evaluated in the driver process when TimeEval is configured. This means that the datasets must be available on the node executing the driver process. The calculated parameter values are then injected into the algorithm configuration and the algorithm is executed on the cluster.
See also
timeeval.heuristics.inject_heuristic_values()
Function that uses the heuristics to calculate parameter values for algorithms.
- classmethod get_param_names() List[str] ¶
Get parameter names (arguments) for the heuristic.
Adopted from https://github.com/scikit-learn/scikit-learn/blob/2beed5584/sklearn/base.py.
timeeval.heuristics.AnomalyLengthHeuristic¶
- class timeeval.heuristics.AnomalyLengthHeuristic(agg_type: str = 'median')¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the anomaly length of the dataset as parameter value. Uses ground-truth labels, and should therefore only be used for testing purposes.
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"window_size": "heuristic:AnomalyLengthHeuristic(agg_type='max')"})
- Parameters
agg_type (
str
) – Type of aggregation to use for calculating the anomaly length when multiple anomalies are present in the time series. Must be one of min, median, or max. (default: median)
timeeval.heuristics.CleanStartSequenceSizeHeuristic¶
- class timeeval.heuristics.CleanStartSequenceSizeHeuristic(max_factor: float = 0.1)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to compute the number of time steps until the first anomaly occurs, and use it as parameter value. Uses ground-truth labels. Allows to specify a maximum fraction of the entire time series length. The minimum of the computed value and the maximum fraction is used as parameter value.
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"n_init": "heuristic:CleanStartSequenceSizeHeuristic(max_factor=0.1)"})
- Parameters
max_factor (
float
) – Maximum fraction of the entire time series length to use as parameter value. This limits the parameter value. (default: 0.1)
timeeval.heuristics.ContaminationHeuristic¶
- class timeeval.heuristics.ContaminationHeuristic¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the time series’ contamination as parameter value. The contamination is defined as the fraction of anomalous points to all points in the time series.
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"fraction": "heuristic:ContaminationHeuristic()"})
timeeval.heuristics.DatasetIdHeuristic¶
- class timeeval.heuristics.DatasetIdHeuristic¶
Bases:
TimeEvalParameterHeuristic
Heuristic to pass the dataset ID as a parameter value.
The dataset ID is a tuple of the collection name and the dataset name, such as
("KDD-TSAD", "022_UCR_Anomaly_DISTORTEDGP711MarkerLFM5z4")
.Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"dataset_id": "heuristic:DatasetIdHeuristic()"})
timeeval.heuristics.DefaultExponentialFactorHeuristic¶
- class timeeval.heuristics.DefaultExponentialFactorHeuristic(exponent: int = 0, zero_fb: float = 1.0)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the default value multiplied by a factor of $10^{exponent}$ as parameter value.
This allows easier specification of exponential parameter search spaces based on the default value. E.g. if we consider a learning rate parameter with default value 0.01, we can use this heuristic to specify a search space of [0.0001, 0.001, 0.01, 0.1, 1] by using the following parameter values:
"heuristic:DefaultExponentialFactorHeuristic(exponent=-2)"
"heuristic:DefaultExponentialFactorHeuristic(exponent=-1)"
"heuristic:DefaultExponentialFactorHeuristic()"
"heuristic:DefaultExponentialFactorHeuristic(exponent=1)"
"heuristic:DefaultExponentialFactorHeuristic(exponent=2)"
But if the default parameter value is 0.5, the search space would be [0.005, 0.05, 0.5, 5, 50].
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"window_size": "heuristic:DefaultExponentialFactorHeuristic(exponent=1, zero_fb=200)"})
timeeval.heuristics.DefaultFactorHeuristic¶
- class timeeval.heuristics.DefaultFactorHeuristic(factor: float = 1.0, zero_fb: float = 1.0)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the default value multiplied by a factor as parameter value.
This allows easier specification of parameter search spaces based on the default value. E.g. if we consider a n_clusters parameter with default value 50, we can use this heuristic to specify a search space of [10, 25, 50, 75, 100] by using the following parameter values:
"heuristic:DefaultExponentialFactorHeuristic(factor=0.2)"
"heuristic:DefaultExponentialFactorHeuristic(factor=0.5)"
"heuristic:DefaultExponentialFactorHeuristic()"
"heuristic:DefaultExponentialFactorHeuristic(factor=1.5)"
"heuristic:DefaultExponentialFactorHeuristic(factor=2.0)"
But if the default parameter value is 100, the search space would be [20, 50, 100, 150, 200].
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"window_size": "heuristic:DefaultFactorHeuristic(factor=1, zero_fb=200)"})
timeeval.heuristics.EmbedDimRangeHeuristic¶
- class timeeval.heuristics.EmbedDimRangeHeuristic(base_factor: float = 1.0, base_fb_value: int = 50, dim_factors: Optional[List[float]] = None)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use a range of embedding dimensions as parameter value.
The base dimensionality is calculated based on the
PeriodSizeHeuristic
, base factor, and base fallback value. The base dimensionality is then multiplied by the factors specified indim_factors
to create the embedding dimension range.Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({ ... "embed_dim": "heuristic:EmbedDimRangeHeuristic(base_factor=1, base_fb_value=50, dim_factors=[0.5, 1.0, 1.5])" ... })
- Parameters
base_factor (
float
) – Factor to use for the base dimensionality. Directly passed on to thePeriodSizeHeuristic
. (default: 1.0)base_fb_value (
int
) – Fallback value to use for the base dimensionality. Directly passed on to thePeriodSizeHeuristic
. (default: 50)dim_factors (
List[float]
) – Factors to use for the creation of the embedding dimension range. (default: [0.5, 1.0, 1.5])
timeeval.heuristics.ParameterDependenceHeuristic¶
- class timeeval.heuristics.ParameterDependenceHeuristic(source_parameter: str, fn: Optional[Callable[[Any], Any]] = None, factor: Optional[float] = None)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the value of another parameter as parameter value.
ParameterDependenceHeuristic
can be used to create a parameter value that depends on another parameter. This can be done by either supplying a mapping function or a factor. If a mapping function is supplied, it is called with the value of the source parameter as the only argument. If a factor is supplied, the value of the source parameter is multiplied by the factor. You cannot supply both a mapping function and a factor! This heuristic is evaluated after all other heuristics, so you can use it to create a parameter value that depends on the values of other parameters filled by heuristics.Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({ ... "latent_dim": "heuristic:ParameterDependenceHeuristic(source_parameter='window_size', factor=0.5)" ... })
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({ ... "latent_dims": "heuristic:ParameterDependenceHeuristic(source_parameter='window_size', fn=lambda x: [x // 2, x, x * 2])" ... })
timeeval.heuristics.PeriodSizeHeuristic¶
- class timeeval.heuristics.PeriodSizeHeuristic(factor: float = 1.0, fb_anomaly_length_agg_type: Optional[str] = None, fb_value: int = 1)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to use the period size of the dataset as parameter value.
Not all datasets have a period size, so this heuristic uses the following fallbacks in order:
1. If
fb_anomaly_length_agg_type
is specified, theAnomalyLengthHeuristic
with the specified aggregation type is used as fallback. 2. Iffb_value
is specified, it is directly used as fallback.Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({ ... "window_size": "heuristic:PeriodSizeHeuristic(factor=1.0, fb_anomaly_length_agg_type='median', fb_value=100)" ... })
- Parameters
factor (
float
) – Factor to use for the period size. (default: 1.0)fb_anomaly_length_agg_type (
str
, optional) – Aggregation type to use for theAnomalyLengthHeuristic
fallback. (default: None)fb_value (
int
, optional) – Value to use as fallback if no period size is available. (default: 1)
timeeval.heuristics.RelativeDatasetSizeHeuristic¶
- class timeeval.heuristics.RelativeDatasetSizeHeuristic(factor: float = 0.1)¶
Bases:
TimeEvalParameterHeuristic
Heuristic to set a parameter value depending on the size of the dataset (length of the time series).
Examples
>>> from timeeval.params import FixedParameters >>> params = FixedParameters({"n_init": "heuristic:RelativeDatasetSizeHeuristic(factor=0.1)"})
- Parameters
factor (
float
) – Factor to multiply the dataset length with to get the parameter value. (default: 0.1)
timeeval.metrics package¶
This module contains all metrics that can be used with TimeEval. The metrics are divided into five different categories:
Classification-metrics: These metrics are defined over binary classification predictions (zeros or ones), thus they require a thresholding strategy to convert anomaly scorings to binary classification results.
AUC-metrics: All AUC-Metrics support continuous scorings, and calculate the area under a custom curve function.
Range-metrics: Range-metrics compute the quality scores for anomaly ranges (windows) instead of each point in the time series.
VUS-metrics: The metrics of this category share a custom definition of range-based recall and range-based precision [PaparrizosEtAl2022].
Other-metrics: Metrics that don’t belong to any of the above categories:
All metrics inherit from the abstract base class Metric
, and implement the __call__
method, the supports_continuous_scorings
method, and the name
property. This allows them to be used within
TimeEval and on their own. You can also implement your own metrics by inheriting from timeeval.metrics.Metric
(see its documentation for more information).
Examples
Using the default metric list that just contains ROC_AUC:
>>> from timeeval import TimeEval, DefaultMetrics
>>> TimeEval(dataset_mgr=..., datasets=[], algorithms=[],
>>> metrics=DefaultMetrics.default_list())
Using a custom selection of metrics:
>>> from timeeval import TimeEval
>>> from timeeval.metrics import RangeRocAUC, RangeRocVUS, RangePrAUC, RangePrVUS
>>> TimeEval(dataset_mgr=..., datasets=[], algorithms=[],
>>> metrics=[RangeRocAUC(buffer_size=100), RangeRocVUS(max_buffer_size=100),
>>> RangePrAUC(buffer_size=100), RangePrVUS(max_buffer_size=100)])
Using the metrics without TimeEval:
>>> import numpy as np
>>> from timeeval import DefaultMetrics
>>> from timeeval.metrics import RangePrAUC
>>> from timeeval.metrics.thresholding import PercentileThresholding
>>> rng = np.random.default_rng(42)
>>> y_true = rng.random(100) > 0.5
>>> y_score = rng.random(100)
>>> metrics = [
>>> # default metrics are already parameterized objects:
>>> DefaultMetrics.ROC_AUC,
>>> # all metrics (in general) are classes that need to be instantiated with their parameterization:
>>> RangePrAUC(buffer_size=100),
>>> # classification metrics need a thresholding strategy for continuous scorings:
>>> F1Score(PercentileThresholding(percentile=95))
>>> ]
>>> # compute the metrics
>>> for m in metrics:
>>> metric_value = m(y_true, y_score)
>>> print(f"{m.name} = {metric_value}")
timeeval.metrics.Metric¶
- class timeeval.metrics.Metric¶
Bases:
ABC
Base class for metric implementations that score anomaly scorings against ground truth binary labels. Every subclass must implement
name()
,score()
, andsupports_continuous_scorings()
.Examples
You can implement a new TimeEval metric easily by inheriting from this base class. A simple metric, for example, uses a fixed threshold to get binary labels and computes the false positive rate:
>>> from timeeval.metrics import Metric >>> class FPR(Metric): >>> def __init__(self, threshold: float = 0.8): >>> self._threshold = threshold >>> @property >>> def name(self) -> str: >>> return f"FPR@{self._threshold}" >>> def score(self, y_true: np.ndarray, y_score: np.ndarray) -> float: >>> y_pred = y_score >= self._threshold >>> fp = np.sum(y_pred & ~y_true) >>> return fp / (fp + np.sum(y_true)) >>> def supports_continuous_scorings(self) -> bool: >>> return True
This metric can then be used in TimeEval:
>>> from timeeval import TimeEval >>> from timeeval.metrics import DefaultMetrics >>> timeeval = TimeEval(dmgr=..., datasets=[], algorithms=[], >>> metrics=[FPR(threshold=0.8), DefaultMetrics.ROC_AUC])
- abstract score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RocAUC¶
- class timeeval.metrics.RocAUC(plot: bool = False, plot_store: bool = False)¶
Bases:
AucMetric
Computes the area under the receiver operating characteristic curve.
- Parameters
See also
https://en.wikipedia.org/wiki/Receiver_operating_characteristic : Explanation of the ROC-curve.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.PrAUC¶
- class timeeval.metrics.PrAUC(plot: bool = False, plot_store: bool = False)¶
Bases:
AucMetric
Computes the area under the precision recall curve.
- Parameters
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangePrecisionRangeRecallAUC¶
- class timeeval.metrics.RangePrecisionRangeRecallAUC(max_samples: int = 50, r_alpha: float = 0.5, p_alpha: float = 0, cardinality: str = 'reciprocal', bias: str = 'flat', plot: bool = False, plot_store: bool = False, name: str = 'RANGE_PR_AUC')¶
Bases:
AucMetric
Computes the area under the precision recall curve when using the range-based precision and range-based recall metric introduced by Tatbul et al. at NeurIPS 2018 [TatbulEtAl2018].
- Parameters
max_samples (
int
) – TimeEval uses a community implementation of the range-based precision and recall metrics, which is quite slow. To prevent long runtimes caused by scorings with high precision (many thresholds), just a specific amount of possible thresholds is sampled. This parameter controls the maximum number of thresholds; too low numbers degrade the metrics’ quality.r_alpha (
float
) – Weight of the existence reward for the range-based recall.p_alpha (
float
) – Weight of the existence reward for the range-based precision. For most - when not all - cases, p_alpha should be set to 0.cardinality (
{'reciprocal', 'one', 'udf_gamma'}
) – Cardinality type.bias (
{'flat', 'front', 'middle', 'back'}
) – Positional bias type.plot (
bool
) –plot_store (
bool
) –name (
str
) – Custom name for this metric (e.g. including your parameter changes).
References
- TatbulEtAl2018(1,2,3,4)
Tatbul, Nesime, Tae Jun Lee, Stan Zdonik, Mejbah Alam, and Justin Gottschlich. “Precision and Recall for Time Series.” In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), 1920–30. 2018. http://papers.nips.cc/paper/7462-precision-and-recall-for-time-series.pdf.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.AveragePrecision¶
- class timeeval.metrics.AveragePrecision(**kwargs)¶
Bases:
Metric
Computes the average precision metric aver all possible thresholds.
This metric is an approximation of the
timeeval.metrics.PrAUC
-metric.- Parameters
kwargs (
dict
) – Keyword arguments that get passed down tosklearn.metrics.average_precision_score()
See also
sklearn.metrics.average_precision_score
Implementation of the average precision metric.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.Precision¶
- class timeeval.metrics.Precision(thresholding_strategy: ThresholdingStrategy)¶
Bases:
ClassificationMetric
Computes the precision metric.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Thresholding strategy used to transform the anomaly scorings to binary classification predictions.
See also
sklearn.metrics.precision_score
Implementation of the precision metric.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.Recall¶
- class timeeval.metrics.Recall(thresholding_strategy: ThresholdingStrategy)¶
Bases:
ClassificationMetric
Computes the recall metric.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Thresholding strategy used to transform the anomaly scorings to binary classification predictions.
See also
sklearn.metrics.recall_score
Implementation of the recall metric.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.F1Score¶
- class timeeval.metrics.F1Score(thresholding_strategy: ThresholdingStrategy)¶
Bases:
ClassificationMetric
Computes the F1 metric, which is the harmonic mean of precision and recall.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Thresholding strategy used to transform the anomaly scorings to binary classification predictions.
See also
sklearn.metrics.f1_score
Implementation of the F1 metric.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangePrecision¶
- class timeeval.metrics.RangePrecision(thresholding_strategy: ThresholdingStrategy = NoThresholding(), alpha: float = 0, cardinality: str = 'reciprocal', bias: str = 'flat', name: str = 'RANGE_PRECISION')¶
Bases:
Metric
Computes the range-based precision metric introduced by Tatbul et al. at NeurIPS 2018 [TatbulEtAl2018].
Range precision is the average precision of each predicted anomaly range. For each predicted continuous anomaly range the overlap size, position, and cardinality is considered.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Strategy used to find a threshold over continuous anomaly scores to get binary labels. Usetimeeval.metrics.thresholding.NoThresholding
for results that already contain binary labels.alpha (
float
) – Weight of the existence reward. Because precision by definition emphasizes on prediction quality, there is no need for an existence reward and this value should always be set to 0.cardinality (
{'reciprocal', 'one', 'udf_gamma'}
) – Cardinality type.bias (
{'flat', 'front', 'middle', 'back'}
) – Positional bias type.name (
str
) – Custom name for this metric (e.g. including your parameter changes).
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangeRecall¶
- class timeeval.metrics.RangeRecall(thresholding_strategy: ThresholdingStrategy = NoThresholding(), alpha: float = 0, cardinality: str = 'reciprocal', bias: str = 'flat', name: str = 'RANGE_RECALL')¶
Bases:
Metric
Computes the range-based recall metric introduced by Tatbul et al. at NeurIPS 2018 [TatbulEtAl2018].
Range recall is the average recall of each real anomaly range. For each real anomaly range the overlap size, position, and cardinality with predicted anomaly ranges are considered. In addition, an existence reward can be given that boosts the recall even if just a single point of the real anomaly is in the predicted ranges.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Strategy used to find a threshold over continuous anomaly scores to get binary labels. Usetimeeval.metrics.thresholding.NoThresholding
for results that already contain binary labels.alpha (
float
) – Weight of the existence reward. If 0: no existence reward, if 1: only existence reward. The existence reward is given if the real anomaly range has overlap with even a single point of the predicted anomaly range.cardinality (
{'reciprocal', 'one', 'udf_gamma'}
) – Cardinality type.bias (
{'flat', 'front', 'middle', 'back'}
) – Positional bias type.name (
str
) – Custom name for this metric (e.g. including your parameter changes).
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangeFScore¶
- class timeeval.metrics.RangeFScore(thresholding_strategy: ThresholdingStrategy = NoThresholding(), beta: float = 1, p_alpha: float = 0, r_alpha: float = 0.5, cardinality: str = 'reciprocal', p_bias: str = 'flat', r_bias: str = 'flat', name: Optional[str] = None)¶
Bases:
Metric
Computes the range-based F-score using the recall and precision metrics by Tatbul et al. at NeurIPS 2018 [TatbulEtAl2018].
The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. This implementation uses the range-based precision and range-based recall as basis.
- Parameters
thresholding_strategy (
ThresholdingStrategy
) – Strategy used to find a threshold over continuous anomaly scores to get binary labels. Usetimeeval.metrics.thresholding.NoThresholding
for results that already contain binary labels.beta (
float
) – F-score beta determines the weight of recall in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall.p_alpha (
float
) – Weight of the existence reward for the range-based precision. For most - when not all - cases, p_alpha should be set to 0.r_alpha (
float
) – Weight of the existence reward. If 0: no existence reward, if 1: only existence reward.cardinality (
{'reciprocal', 'one', 'udf_gamma'}
) – Cardinality type.p_bias (
{'flat', 'front', 'middle', 'back'}
) – Positional bias type.r_bias (
{'flat', 'front', 'middle', 'back'}
) – Positional bias type.name (
str
) – Custom name for this metric (e.g. including your parameter changes). If None, will include the beta-value in the name: “RANGE_F{beta}_SCORE”.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.FScoreAtK¶
- class timeeval.metrics.FScoreAtK(k: Optional[int] = None)¶
Bases:
Metric
Computes the F-score at k based on anomaly ranges.
This metric only considers the top-k predicted anomaly ranges within the scoring by finding a threshold on the scoring that produces at least k anomaly ranges. If k is not specified, the number of anomalies within the ground truth is used as k.
- Parameters
k (
int (optional)
) – Number of top anomalies used to calculate precision. If k is not specified (None) the number of true anomalies (based on the ground truth values) is used.
See also
timeeval.metrics.thresholding.TopKRangesThresholding
Thresholding approach used.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.PrecisionAtK¶
- class timeeval.metrics.PrecisionAtK(k: Optional[int] = None)¶
Bases:
Metric
Computes the Precision at k based on anomaly ranges.
This metric only considers the top-k predicted anomaly ranges within the scoring by finding a threshold on the scoring that produces at least k anomaly ranges. If k is not specified, the number of anomalies within the ground truth is used as k.
- Parameters
k (
int (optional)
) – Number of top anomalies used to calculate precision. If k is not specified (None) the number of true anomalies (based on the ground truth values) is used.
See also
timeeval.metrics.thresholding.TopKRangesThresholding
Thresholding approach used.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangePrAUC¶
- class timeeval.metrics.RangePrAUC(buffer_size: Optional[int] = None, compatibility_mode: bool = False, max_samples: int = 250, plot: bool = False, plot_store: bool = False)¶
Bases:
RangeAucMetric
Computes the area under the precision-recall-curve using the range-based precision and range-based recall definition from Paparrizos et al. published at VLDB 2022 [PaparrizosEtAl2022].
We first extend the anomaly labels by two slopes of
buffer_size//2
length on both sides of each anomaly, uniformly sample thresholds from the anomaly score, and then compute the confusion matrix for all thresholds. Using the resulting precision and recall values, we can plot a curve and compute its area.We make some changes to the original implementation from [PaparrizosEtAl2022] because we do not agree with the original assumptions. To reproduce the original results, you can set the parameter
compatibility_mode=True
. This will compute exactly the same values as the code by the authors of the paper.The following things are different in TimeEval compared to the original version:
For the recall (FPR) existence reward, we count anomalies as separate events, even if the added slopes overlap.
Overlapping slopes don’t sum up in their anomaly weight, but we just take to maximum anomaly weight for each point in the ground truth.
The original slopes are asymmetric: The slopes at the end of anomalies are a single point shorter than the ones at the beginning of anomalies. We use symmetric slopes of the same size for the beginning and end of anomalies.
We use a linear approximation of the slopes instead of the convex slope shape presented in the paper.
- Parameters
buffer_size (
Optional[int]
) – Size of the buffer region around an anomaly. We add an increasing slope of sizebuffer_size//2
to the beginning of anomalies and a decreasing slope of sizebuffer_size//2
to the end of anomalies. Per default (whenbuffer_size==None
),buffer_size
is the median length of the anomalies within the time series. However, you can also set it to the period size of the dominant frequency or any other desired value.compatibility_mode (
bool
) – When set toTrue
, produces exactly the same output as the metric implementation by the original authors. Otherwise, TimeEval uses a slightly improved implementation that fixes some bugs and uses linear slopes.max_samples (
int
) – Calculating precision and recall for many thresholds is quite slow. We, therefore, uniformly sample thresholds from the available score space. This parameter controls the maximum number of thresholds; too low numbers degrade the metrics’ quality.plot (
bool
) –plot_store (
bool
) –
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangeRocAUC¶
- class timeeval.metrics.RangeRocAUC(buffer_size: Optional[int] = None, compatibility_mode: bool = False, max_samples: int = 250, plot: bool = False, plot_store: bool = False)¶
Bases:
RangeAucMetric
Computes the area under the receiver-operating-characteristic-curve using the range-based TPR and range-based FPR definition from Paparrizos et al. published at VLDB 2022 [PaparrizosEtAl2022].
We first extend the anomaly labels by two slopes of
buffer_size//2
length on both sides of each anomaly, uniformly sample thresholds from the anomaly score, and then compute the confusion matrix for all thresholds. Using the resulting false positive (FPR) and false positive rates (FPR), we can plot a curve and compute its area.We make some changes to the original implementation from [PaparrizosEtAl2022] because we do not agree with the original assumptions. To reproduce the original results, you can set the parameter
compatibility_mode=True
. This will compute exactly the same values as the code by the authors of the paper.The following things are different in TimeEval compared to the original version:
For the recall (FPR) existence reward, we count anomalies as separate events, even if the added slopes overlap.
Overlapping slopes don’t sum up in their anomaly weight, but we just take to maximum anomaly weight for each point in the ground truth.
The original slopes are asymmetric: The slopes at the end of anomalies are a single point shorter than the ones at the beginning of anomalies. We use symmetric slopes of the same size for the beginning and end of anomalies.
We use a linear approximation of the slopes instead of the convex slope shape presented in the paper.
- Parameters
buffer_size (
Optional[int]
) – Size of the buffer region around an anomaly. We add an increasing slope of sizebuffer_size//2
to the beginning of anomalies and a decreasing slope of sizebuffer_size//2
to the end of anomalies. Per default (whenbuffer_size==None
),buffer_size
is the median length of the anomalies within the time series. However, you can also set it to the period size of the dominant frequency or any other desired value.compatibility_mode (
bool
) – When set toTrue
, produces exactly the same output as the metric implementation by the original authors. Otherwise, TimeEval uses a slightly improved implementation that fixes some bugs and uses linear slopes.max_samples (
int
) – Calculating precision and recall for many thresholds is quite slow. We, therefore, uniformly sample thresholds from the available score space. This parameter controls the maximum number of thresholds; too low numbers degrade the metrics’ quality.plot (
bool
) –plot_store (
bool
) –
See also
https://en.wikipedia.org/wiki/Receiver_operating_characteristic : Explanation of the ROC-curve.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangePrVUS¶
- class timeeval.metrics.RangePrVUS(max_buffer_size: int = 500, compatibility_mode: bool = False, max_samples: int = 250)¶
Bases:
RangeAucMetric
Computes the volume under the precision-recall-buffer_size-surface using the range-based precision and range-based recall definition from Paparrizos et al. published at VLDB 2022 [PaparrizosEtAl2022].
For all buffer sizes from 0 to
max_buffer_size
, we first extend the anomaly labels by two slopes ofbuffer_size//2
length on both sides of each anomaly, uniformly sample thresholds from the anomaly score, and then compute the confusion matrix for all thresholds. Using the resulting precision and recall values, we can plot a curve and compute its area.This metric includes similar changes as
RangePrAUC
, which can be disabled using thecompatibility_mode
parameter.- Parameters
max_buffer_size (
int
) – Maximum size of the buffer region around an anomaly. We iterate over all buffer sizes from 0 tomay_buffer_size
to create the surface.compatibility_mode (
bool
) – When set toTrue
, produces exactly the same output as the metric implementation by the original authors. Otherwise, TimeEval uses a slightly improved implementation that fixes some bugs and uses linear slopes.max_samples (
int
) – Calculating precision and recall for many thresholds is quite slow. We, therefore, uniformly sample thresholds from the available score space. This parameter controls the maximum number of thresholds; too low numbers degrade the metrics’ quality.
See also
timeeval.metrics.RangePrAUC
Area under the curve version using a single buffer size.
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.RangeRocVUS¶
- class timeeval.metrics.RangeRocVUS(max_buffer_size: int = 500, compatibility_mode: bool = False, max_samples: int = 250)¶
Bases:
RangeAucMetric
Computes the volume under the receiver-operating-characteristic-buffer_size-surface using the range-based TPR and range-based FPR definition from Paparrizos et al. published at VLDB 2022 [PaparrizosEtAl2022].
For all buffer sizes from 0 to
max_buffer_size
, we first extend the anomaly labels by two slopes ofbuffer_size//2
length on both sides of each anomaly, uniformly sample thresholds from the anomaly score, and then compute the confusion matrix for all thresholds. Using the resulting false positive (FPR) and false positive rates (FPR), we can plot a curve and compute its area.This metric includes similar changes as
RangeRocAUC
, which can be disabled using thecompatibility_mode
parameter.- Parameters
max_buffer_size (
int
) – Maximum size of the buffer region around an anomaly. We iterate over all buffer sizes from 0 tomay_buffer_size
to create the surface.compatibility_mode (
bool
) – When set toTrue
, produces exactly the same output as the metric implementation by the original authors. Otherwise, TimeEval uses a slightly improved implementation that fixes some bugs and uses linear slopes.max_samples (
int
) – Calculating precision and recall for many thresholds is quite slow. We, therefore, uniformly sample thresholds from the available score space. This parameter controls the maximum number of thresholds; too low numbers degrade the metrics’ quality.
See also
- https://en.wikipedia.org/wiki/Receiver_operating_characteristic :
Explanation of the ROC-curve.
- timeeval.metrics.RangeRocAUC :
Area under the curve version using a single buffer size.
References
- PaparrizosEtAl2022(1,2,3,4,5,6,7)
John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J. Franklin. Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. PVLDB, 15(11): 2774 - 2787, 2022. doi:10.14778/3551793.3551830
- score(y_true: ndarray, y_score: ndarray) float ¶
Implementation of the metric’s scoring function.
Please use
__call__()
instead of calling this function directly!Examples
Instantiate a metric and call it using the
__call__
method:>>> import numpy as np >>> from timeeval.metrics import RocAUC >>> metric = RocAUC(plot=False) >>> metric(np.array([0, 1, 1, 0]), np.array([0.1, 0.4, 0.35, 0.8])) 0.5
timeeval.metrics.DefaultMetrics¶
- class timeeval.metrics.DefaultMetrics¶
Default metrics of TimeEval that can be used directly for time series anomaly detection algorithms without further configuration.
Examples
Using the default metric list that just contains ROC_AUC:
>>> from timeeval import TimeEval, DefaultMetrics >>> TimeEval(dataset_mgr=..., datasets=[], algorithms=[], >>> metrics=DefaultMetrics.default_list())
You can also specify multiple default metrics:
>>> from timeeval import TimeEval, DefaultMetrics >>> TimeEval(dataset_mgr=..., datasets=[], algorithms=[], >>> metrics=[DefaultMetrics.ROC_AUC, DefaultMetrics.PR_AUC, DefaultMetrics.FIXED_RANGE_PR_AUC])
- AVERAGE_PRECISION = <timeeval.metrics.other_metrics.AveragePrecision object>¶
- FIXED_RANGE_PR_AUC = <timeeval.metrics.range_metrics.RangePrecisionRangeRecallAUC object>¶
- PR_AUC = <timeeval.metrics.auc_metrics.PrAUC object>¶
- RANGE_F1 = <timeeval.metrics.range_metrics.RangeFScore object>¶
- RANGE_PRECISION = <timeeval.metrics.range_metrics.RangePrecision object>¶
- RANGE_PR_AUC = <timeeval.metrics.range_metrics.RangePrecisionRangeRecallAUC object>¶
- RANGE_RECALL = <timeeval.metrics.range_metrics.RangeRecall object>¶
- ROC_AUC = <timeeval.metrics.auc_metrics.RocAUC object>¶
timeeval.metrics.thresholding package¶
timeeval.metrics.thresholding.ThresholdingStrategy¶
- class timeeval.metrics.thresholding.ThresholdingStrategy¶
Bases:
ABC
Takes an anomaly scoring and ground truth labels to compute and apply a threshold to the scoring.
Subclasses of this abstract base class define different strategies to put a threshold over the anomaly scorings. All strategies produce binary labels (0 or 1; 1 for anomalous) in the form of an integer NumPy array. The strategy
NoThresholding
is a special no-op strategy that checks for already existing binary labels and keeps them untouched. This allows applying the metrics on existing binary classification results.- abstract find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Abstract method containing the actual code to determine the threshold. Must be overwritten by subclasses!
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.NoThresholding¶
- class timeeval.metrics.thresholding.NoThresholding¶
Bases:
ThresholdingStrategy
Special no-op strategy that checks for already existing binary labels and keeps them untouched. This allows applying the metrics on existing binary classification results.
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Does nothing (no-op).
- Parameters
y_true (
np.ndarray
) – Ignored.y_score (
np.ndarray
) – Ignored.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Does nothing (no-op).
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- transform(y_score: ndarray) ndarray ¶
Checks if the provided scoring y_score is actually a binary classification prediction of integer type. If this is the case, the prediction is returned. If not, a
ValueError
is raised.- Parameters
y_score (
np.ndarray
) – Anomaly scoring with binary predictions.- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.FixedValueThresholding¶
- class timeeval.metrics.thresholding.FixedValueThresholding(threshold: float = 0.8)¶
Bases:
ThresholdingStrategy
Thresholding approach using a fixed threshold value.
- Parameters
threshold (
float
) – Fixed threshold to use. All anomaly scorings are scaled to the interval [0, 1]
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.PercentileThresholding¶
- class timeeval.metrics.thresholding.PercentileThresholding(percentile: int = 90)¶
Bases:
ThresholdingStrategy
Use the xth-percentile of the anomaly scoring as threshold.
- Parameters
percentile (
int
) – The percentile of the anomaly scoring to use. Must be between 0 and 100.
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Computes the xth-percentile ignoring NaNs and using a linear interpolation.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
threshold – The xth-percentile of the anomaly scoring as threshold.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.TopKPointsThresholding¶
- class timeeval.metrics.thresholding.TopKPointsThresholding(k: Optional[int] = None)¶
Bases:
ThresholdingStrategy
Calculates a threshold so that exactly k points are marked anomalous.
- Parameters
k (
optional int
) – Number of expected anomalous points. If k is None, the ground truth data is used to calculate the real number of anomalous points.
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Computes a threshold based on the number of expected anomalous points.
The threshold is determined by taking the reciprocal ratio of expected anomalous points to all points as target percentile. We, again, ignore NaNs and use a linear interpolation. If k is None, the ground truth data is used to calculate the real ratio of anomalous points to all points. Otherwise, k is used as the number of expected anomalous points.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
threshold – Threshold that yields k anomalous points.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.TopKRangesThresholding¶
- class timeeval.metrics.thresholding.TopKRangesThresholding(k: Optional[int] = None)¶
Bases:
ThresholdingStrategy
Calculates a threshold so that exactly k anomalies are found. The anomalies are either single-points anomalies or continuous anomalous ranges.
- Parameters
k (
optional int
) – Number of expected anomalies. If k is None, the ground truth data is used to calculate the real number of anomalies.
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Computes a threshold based on the number of expected anomalous subsequences / ranges (number of anomalies).
This method iterates over all possible thresholds from high to low to find the first threshold that yields k or more continuous anomalous ranges.
If k is None, the ground truth data is used to calculate the real number of anomalies (anomalous ranges).
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
threshold – Threshold that yields k anomalies.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.SigmaThresholding¶
- class timeeval.metrics.thresholding.SigmaThresholding(factor: float = 3.0)¶
Bases:
ThresholdingStrategy
Computes a threshold \(\theta\) based on the anomaly scoring’s mean \(\mu_s\) and the standard deviation \(\sigma_s\):
\[\theta = \mu_{s} + x \cdot \sigma_{s}\]- Parameters
factor (
float
) – Multiples of the standard deviation to be added to the mean to compute the threshold (\(x\)).
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Determines the mean and standard deviation ignoring NaNs of the anomaly scoring and computes the threshold using the mentioned equation.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
threshold – Computed threshold based on mean and standard deviation.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.metrics.thresholding.PyThreshThresholding¶
- class timeeval.metrics.thresholding.PyThreshThresholding(pythresh_thresholder: BaseThresholder, random_state: Any = None)¶
Bases:
ThresholdingStrategy
Uses a thresholder from the PyThresh package to find a scoring threshold and to transform the continuous anomaly scoring into binary anomaly predictions.
Warning
You need to install PyThresh before you can use this thresholding strategy:
pip install pythresh>=0.2.8
Please note the additional package requirements for some available thresholders of PyThresh.
- Parameters
pythresh_thresholder (
pythresh.thresholds.base.BaseThresholder
) – Initiated PyThresh thresholder.random_state (
Any
) –Seed used to seed the numpy random number generator used in some thresholders of PyThresh. Note that PyThresh uses the legacy global RNG (
np.random
) and we try to reset the global RNG after calling PyThresh. Can be left at its default value for most thresholders that don’t use random numbers or provide their own way of seeding. Please consult the PyThresh Documentation for details about the individual thresholders.Deprecated since version 1.2.8: Since pythresh version 0.2.8, thresholders provide a way to set their RNG state correctly. So the parameter
random_state
is not needed anymore. Please use the pythresh thresholder’s parameter to seed it. This function’s parameter is kept for compatibility with pythresh<0.2.8.
Examples
from timeeval.metrics.thresholding import PyThreshThresholding from pythresh.thresholds.regr import REGR import numpy as np thresholding = PyThreshThresholding( REGR(method="theil") ) y_scores = np.random.default_rng().random(1000) y_labels = np.zeros(1000) y_pred = thresholding.fit_transform(y_labels, y_scores)
- find_threshold(y_true: ndarray, y_score: ndarray) float ¶
Uses the passed thresholder from the PyThresh package to determine the threshold. Beforehand, the scores are forced to be finite by replacing NaNs with 0 and (Neg)Infs with 1.
PyThresh thresholders directly compute the binary predictions. Thus, we cache the predictions in the member
_predictions
and return them when callingtransform()
.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
threshold – Threshold computed by the internal thresholder.
- Return type
- fit(y_true: ndarray, y_score: ndarray) None ¶
Calls
find_threshold()
to compute and set the threshold.- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- fit_transform(y_true: ndarray, y_score: ndarray) ndarray ¶
Determines the threshold and applies it to the scoring in one go.
- Parameters
y_true (
np.ndarray
) – Ground truth binary labels.y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).
- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
- transform(y_score: ndarray) ndarray ¶
Applies the threshold to the anomaly scoring and returns the corresponding binary labels.
- Parameters
y_score (
np.ndarray
) – Anomaly scoring with continuous anomaly scores (same length as y_true).- Returns
y_pred – Array of binary labels; 0 for normal points and 1 for anomalous points.
- Return type
np.ndarray
timeeval.params package¶
timeeval.params.ParameterConfig¶
- class timeeval.params.ParameterConfig¶
-
Base class for algorithm hyperparameter configurations.
Currently, TimeEval supports three kinds of parameter configurations:
FixedParameters
: A single parameter setting with one value for each parameter.IndependentParameterGrid
andFullParameterGrid
: Parameter search using the specification of a parameter grid, where each parameter can have multiple values. Depending on the parameter grid, TimeEval will build a parameter search space and test all combinations of parameters.BayesianParameterSearch
: Parameter search using Bayesian optimization.
- static defaults() ParameterConfig ¶
Returns the default parameter configuration that has only a single parameter setting with no parameters.
timeeval.params.FixedParameters¶
- class timeeval.params.FixedParameters(params: Mapping[str, Any])¶
Bases:
ParameterConfig
Single parameters setting with one value for each.
Iterating over this grid yields the input setting as the first and only element.
- Parameters
params (
dict
ofstr
toAny
) – The parameter setting to be evaluated, as a dictionary mapping parameters to allowed values. An empty dict signifies default parameters.
Examples
>>> from timeeval.params import FixedParameters >>> params = {"a": 2, "b": True} >>> list(FixedParameters(params)) == ( ... [{"a": 2, "b": True}]) True >>> FixedParameters(params)[0] == {"a": 2, "b": True} True
timeeval.params.FullParameterGrid¶
- class timeeval.params.FullParameterGrid(param_grid: Mapping[str, Any])¶
Bases:
ParameterGridConfig
Grid of parameters with a discrete number of values for each.
Iterating over this grid yields the full cartesian product of all available parameter combinations. Uses the sklearn.model_selection.ParameterGrid internally.
- Parameters
param_grid (
dict
ofstr
tosequence
) – The parameter grid to explore, as a dictionary mapping parameters to sequences of allowed values. An empty dict signifies default parameters.
Examples
>>> from timeeval.params import FullParameterGrid >>> params = {"a": [1, 2], "b": [True, False]} >>> list(FullParameterGrid(params)) == ( ... [{"a": 1, "b": True}, {"a": 1, "b": False}, ... {"a": 2, "b": True}, {"a": 2, "b": False}]) True >>> FullParameterGrid(params)[1] == {"a": 1, "b": False} True
See also
sklearn.model_selection.ParameterGrid
Used internally to represent the parameter grids.
- property param_grid: ParameterGrid¶
The parameter search grid.
- Returns
param_grid – A parameter search grid compatible with sklearn:
sklearn.model_selection.ParameterGrid
- Return type
sklearn parameter grid object
timeeval.params.IndependentParameterGrid¶
- class timeeval.params.IndependentParameterGrid(param_grid: Mapping[str, Any], default_params: Optional[Mapping[str, Any]] = None)¶
Bases:
ParameterGridConfig
Grid of parameters with a discrete number of values for each.
The parameters in the dict are considered independent and explored one after the other (no cartesian product). Uses the sklearn.model_selection.ParameterGrid internally.
- Parameters
param_grid (
dict
ofstr
tosequence
, orsequence
ofsuch
) – The parameter grid to explore, as either - a dictionary mapping parameters to sequences of allowed values, or - a sequence of dicts signifying a sequence of grids to search. An empty dict signifies default parameters.default_params (
dict
ofstr
toany values
) – Default values for the parameters that are not in the current parameter grid.
Examples
>>> from timeeval.params import IndependentParameterGrid >>> params = {"a": [1, 2], "b": [True, False]} >>> default_params = {"a": 1, "b": True, "c": "auto"} >>> list(IndependentParameterGrid(params)) == ([ ... {"a": 1, "b": True, "c": "auto"}, ... {"a": 2, "b": True, "c": "auto"}, ... {"a": 1, "b": True, "c": "auto"}, ... {"a": 1, "b": False, "c": "auto"} ... ]) True
See also
sklearn.model_selection.ParameterGrid
Used internally to represent the parameter grids.
- property param_grid: ParameterGrid¶
The parameter search grid.
- Returns
param_grid – A parameter search grid compatible with sklearn:
sklearn.model_selection.ParameterGrid
- Return type
sklearn parameter grid object
timeeval.params.BayesianParameterSearch¶
- class timeeval.params.BayesianParameterSearch(config: OptunaStudyConfiguration, params: Mapping[str, BaseDistribution], include_default_params: bool = False)¶
Bases:
ParameterConfig
Performs Bayesian optimization using Optuna integration.
Note
Please install the Optuna package to use this class. If you use the recommended PostgreSQL storage backend, you also need to install the psycopg2 or psycopg2-binary package:
pip install optuna>=3.1.0 psycopg2
Warning
Parameter search using this class and the Optuna integration is non-deterministic. The results may vary between different runs, even if the same seed is used (e.g., for the Optuna sampler or pruner). This is because TimeEval needs to re-seed the Optuna samplers for every trial in distributed mode. This is necessary to ensure that initial random samples are different over all workers.
- Parameters
config (
OptunaStudyConfiguration
) – Configuration for the Optuna study. Optional parameters are filled in with the default values from the global Optuna configuration.params (
Mapping[str
,BaseDistribution]
) – Mapping from parameter names to the corresponding Optuna distributions, such asIntDistribution
,FloatDistribution
, orCategoricalDistribution
.
Examples
>>> from timeeval.params import BayesianParameterSearch >>> from timeeval.metrics import RangePrAUC >>> from timeeval.integration.optuna import OptunaStudyConfiguration >>> from optuna.distributions import FloatDistribution, IntDistribution >>> config = OptunaStudyConfiguration(n_trials=10, metric=RangePrAUC()) >>> distributions = { ... "max_features": FloatDistribution(low=0.0, high=1.0, step=0.01), ... "window_size": IntDistribution(low=5, high=1000, step=5), ... } >>> BayesianParameterSearch(config, distributions) <timeeval.params.bayesian.BayesianParameterSearch object at 0x7f9cdc8faf50>
See also
- https://optuna.readthedocs.io:
Optuna documentation.
timeeval.integration.optuna.OptunaModule
:Optuna integration TimeEval module.
- iter(algorithm: Algorithm, dataset: Dataset) Iterator[Params] ¶
Iterate over the points in the grid.
- Returns
params – Yields a params object that maps each parameter to a single value.
- Return type
iterator over Params
- update_config(global_config: OptunaConfiguration) None ¶
Updates unset / default values in the study configuration with the global configuration values.
- Parameters
global_config (
OptunaConfiguration
) – Global Optuna configuration.
timeeval.utils package¶
timeeval.utils.datasets module¶
timeeval.utils.encode_params module¶
timeeval.utils.hash_dict module¶
timeeval.utils.label_formatting module¶
timeeval.utils.results_path module¶
timeeval.utils.tqdm_joblib module¶
- timeeval.utils.tqdm_joblib.tqdm_joblib(tqdm_object: tqdm) Generator[tqdm, None, None] ¶
Context manager to patch joblib to report into tqdm progress bar given as argument.
Directly taken from https://stackoverflow.com/a/58936697.
Examples
>>> import time >>> from joblib import Parallel, delayed >>> >>> def some_method(wait_time): >>> time.sleep(wait_time) >>> >>> with tqdm_joblib(tqdm(desc="Sleeping method", total=10)): >>> Parallel(n_jobs=2)(delayed(some_method)(0.2) for i in range(10))
timeeval.utils.window module¶
- class timeeval.utils.window.Method(value)¶
Bases:
Enum
An enumeration.
- MEAN = 0¶
- MEDIAN = 1¶
- SUM = 2¶
- class timeeval.utils.window.ReverseWindowing(window_size: int, reduction: Method = Method.MEAN, n_jobs: int = 1, chunksize: Optional[int] = None, force_iterative: bool = False)¶
Bases:
TransformerMixin
- fit_transform(X: ndarray, y=None, **fit_params) ndarray ¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (
array-like
ofshape (n_samples
,n_features)
) – Input samples.y (
array-like
ofshape (n_samples,)
or(n_samples
,n_outputs)
, default=None) – Target values (None for unsupervised transformations).**fit_params (
dict
) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array
ofshape (n_samples
,n_features_new)
timeeval._core package¶
Warning
This package contains TimeEval-internal classes and functions. Do not use or change!
timeeval.integration package¶
Available integration modules:
Optuna integration¶
Optuna is an automatic hyperparameter optimization framework and this integration allows you to
use it within TimeEval. TimeEval will load the OptunaModule
automatically if
at least one algorithm uses BayesianParameterSearch
as its parameter search strategy. Please
make sure that you install the required dependencies for Optuna before using this integration (we also recommend to
install psycopg2 to use the PostgreSQL storage backend):
pip install 'optuna>=3.1.0' psycopg2
The following Optuna features are supported:
Definition of search spaces using Optuna distributions for each algorithm (one study per algorithm):
BayesianParameterSearch
.Configurable samplers.
Configurable storage backends (in-memory, RDB, Journal, etc.).
Resuming of existing studies (via RDB storage backend).
Parallel and distributed parameter search of a single or multiple studies (synchronized via RDB storage backend).
Warning
Parameter search using the Optuna integration is non-deterministic. The results may vary between different runs, even if the same seed is used (e.g., for the Optuna sampler or pruner). This is because TimeEval needs to re-seed the Optuna samplers for every trial in distributed mode. This is necessary to ensure that initial random samples are different over all workers.
TimeEval will automatically manage an RDB storage backend if you use the default configuration. This allows you to start TimeEval in distributed mode and perform the parameter search in parallel and distributed.
timeeval.integration.optuna.OptunaModule¶
- class timeeval.integration.optuna.OptunaModule(config: OptunaConfiguration)¶
Bases:
TimeEvalModule
This module is automatically loaded when at least one algorithm uses
timeeval.params.BayesianParameterSearch
as parameter config.TimeEval provides the option to use an automatically managed PostgreSQL database as the storage backend for the Optuna studies. The database is started as an additional Docker container either on the local machine or on the scheduler node in distributed execution mode. The database is automatically stopped when TimeEval is finished. The database storage backend allows you to monitor the studies using the Optuna dashboard (that can also be started automatically using another Docker container) and the distributed execution of the studies. This is the default behavior if no storage backend is specified in the configuration.
- Parameters
config (
OptunaConfiguration
) – The configuration for the Optuna module.
- finalize(timeeval: TimeEval) None ¶
Called during the FINALIZE-phase of TimeEval and after the individual algorithms’ finalize-functions were executed.
- Parameters
timeeval (
TimeEval
) – The TimeEval instance that is currently running.
- load_studies() List[StudySummary] ¶
Load all studies from the default storage. This does not include studies, which were stored in a different storage backend (i.a. where the storage backend was changed using the
timeeval.integration.optuna.OptunaStudyConfiguration
).- Returns
study_summaries – A list of study summaries.
- Return type
List[StudySummary]
See also
optuna.study.get_all_study_summaries()
Optuna function which is used to load the studies.
optuna.study.StudySummary
Optuna class which is used to represent the study summaries.
timeeval.integration.optuna.OptunaConfiguration¶
- class timeeval.integration.optuna.OptunaConfiguration(default_storage: Union[str, Callable[[], BaseStorage]], default_sampler: Optional[BaseSampler] = None, default_pruner: Optional[BasePruner] = None, continue_existing_studies: bool = False, dashboard: bool = False, remove_managed_containers: bool = False, use_default_logging: bool = False, log_level: Union[int, str] = 20)¶
Bases:
object
Configuration options for the Optuna module. This includes default options for all Optuna studies.
- Parameters
default_storage (
str
orLambda returning instance
ofoptuna.storages.BaseStorage
) – Storage to store and synchronize the results of the studies. Per default, TimeEval will use a journal file in local execution mode and a PostgreSQL database in distributed execution mode. The database is automatically started and stopped by TimeEval using the latest postgres-Docker image. Use"postgresql"
to let TimeEval handle starting and stopping a PostgreSQL database using Docker. Use"journal-file"
to let TimeEval create a local file as the storage backend. This only works in non-distributed mode.default_sampler (
optuna.samplers.BaseSampler
, optional) – Sampler to use for the study. If not provided, the default sampler is used.default_pruner (
optuna.pruners.BasePruner
, optional) – Pruner to use for the study. If not provided, the default pruner is used.continue_existing_studies (
bool
, optional) – If True, continue a study with the given name if it already exists in the storage backend. If False, raise an error if a study with the same name already exists.dashboard (
bool
, optional) – If True, start the Optuna dashboard (within its own Docker container) to monitor the studies. In distributed execution mode, the dashboard is started on the scheduler node.remove_managed_containers (
bool
, optional) – If True, remove the containers managed by TimeEval (e.g., the PostgreSQL database) when TimeEval is finished.use_default_logging (
bool
, optional) – If True, use the default logging configuration of the Optuna library. This will log the progress of the studies to stderr. If False, use the logging configuration of TimeEval and propagates the Optuna log messages.log_level (
int
orstr
, optional) – The log level to use for the Optuna logger. The default isinfo
=logging.INFO
=20
.
See also
optuna.create_study()
Used to create the Optuna study object; includes detailed explanation of the parameters.
timeeval.integration.optuna.OptunaModule
Optuna integration module for TimeEval.
- static default(distributed: bool) OptunaConfiguration ¶
- default_pruner: Optional[BasePruner] = None¶
- default_sampler: Optional[BaseSampler] = None¶
timeeval.integration.optuna.OptunaStudyConfiguration¶
- class timeeval.integration.optuna.OptunaStudyConfiguration(n_trials: int, metric: Metric, storage: Optional[Union[str, Callable[[], BaseStorage]]] = None, sampler: Optional[BaseSampler] = None, pruner: Optional[BasePruner] = None, direction: Optional[Union[str, StudyDirection]] = 'maximize', continue_existing_study: bool = False)¶
Bases:
object
Configuration for
BayesianParameterSearch
.The parameters
n_trials
andmetric
are required. All other parameters are optional and will be filled with the default values from the global Optuna configuration if not provided.- Parameters
n_trials (
int
) – Number of trials to perform.metric (
Metric
) – TimeEval metric to use as the studies objective function.storage (
str
orLambda returning instance
ofoptuna.storages.BaseStorage
, optional) – Storage to store the results of the study.sampler (
optuna.samplers.BaseSampler
, optional) – Sampler to use for the study. If not provided, the default sampler is used.pruner (
optuna.pruners.BasePruner
, optional) – Pruner to use for the study. If not provided, the default pruner is used.direction (
str
oroptuna.study.StudyDirection
, optional) – Direction of the optimization (minimize or maximize). IfNone
, the Optuna default direction is used.continue_existing_study (
bool
, optional) – If True, continue a study with the given name if it already exists in the storage backend. If False, raise an error if a study with the same name already exists.
See also
optuna.create_study()
Used to create the Optuna study object; includes detailed explanation of the parameters.
timeeval.integration.optuna.OptunaModule
Optuna integration module for TimeEval.
- copy(n_trials: Optional[int] = None, metric: Optional[Metric] = None, storage: Optional[Union[str, Callable[[], BaseStorage]]] = None, sampler: Optional[BaseSampler] = None, pruner: Optional[BasePruner] = None, direction: Optional[Union[str, StudyDirection]] = None, continue_existing_study: Optional[bool] = None) OptunaStudyConfiguration ¶
Create a copy of this configuration with the given parameters replaced.
- direction: Optional[Union[str, StudyDirection]] = 'maximize'¶
- pruner: Optional[BasePruner] = None¶
- sampler: Optional[BaseSampler] = None¶
- update_unset_options(global_config: OptunaConfiguration) OptunaStudyConfiguration ¶
Base class for TimeEval modules¶
- class timeeval.integration.TimeEvalModule¶
Bases:
ABC
Base class for TimeEval modules that add additional functionality to TimeEval.
Inheriting classes can implement any of the following lifecycle-hooks:
These methods are called at the corresponding time in the TimeEval run loop. Modules can assume that the TimeEval configuration is already loaded and checked for user errors. If TimeEval is executed in distributed mode,
timeeval.TimeEval.distributed
is set toTrue
and remoting is already set up before the first call toprepare()
.Note
Implementing a TimeEval module is an advanced usage scenario and requires a good understanding of the internals of TimeEval.
- finalize(timeeval: TimeEval) None ¶
Called during the FINALIZE-phase of TimeEval and after the individual algorithms’ finalize-functions were executed.
- Parameters
timeeval (
TimeEval
) – The TimeEval instance that is currently running.
- post_run(timeeval: TimeEval) None ¶
Called after the EVALUATION-phase of TimeEval.
- Parameters
timeeval (
TimeEval
) – The TimeEval instance that is currently running.
Contributor’s Guide¶
Installation from source¶
tl;dr
git clone git@github.com:timeeval/timeeval.git
cd timeeval/
conda create -n timeeval python=3.7
conda activate timeeval
pip install -r requirements.txt
python setup.py install
Prerequisites¶
The following tools are required to install TimeEval from source:
git
Python > 3.7 and Pip (anaconda or miniconda is preferred)
Steps¶
Clone this repository using git and change into its root directory.
Create a conda-environment and install all required dependencies.
conda create -n timeeval python=3.7 conda activate timeeval pip install -r requirements.txt
Build TimeEval:
python setup.py bdist_wheel
. This should create a Python wheel in thedist/
-folder.Install TimeEval and all of its dependencies:
pip install dist/TimeEval-*-py3-none-any.whl
.If you want to make changes to TimeEval or run the tests, you need to install the development dependencies from
requirements.dev
:pip install -r requirements.dev
.
Tests¶
Run tests in ./tests/
as follows
python setup.py test
or
pytest tests
If you want to run the tests that include docker and dask, you need to fulfill some prerequisites:
Docker is installed and running.
Your SSH-server is running, and you can SSH to
localhost
with your user without supplying a password.You have installed all TimeEval dev dependencies.
You can then run:
pytest tests --docker --dask
Default Tests¶
By default, tests that are marked with the following keys are skipped:
docker
dask
To run these tests, add the respective keys as parameters:
pytest --[key] # e.g. --docker
Overview¶
TimeEval is an evaluation tool for time series anomaly detection algorithms. It defines common interfaces for datasets and algorithms to allow the efficient comparison of the algorithms’ quality and runtime performance. TimeEval can be configured using a simple Python API and comes with a large collection of compatible datasets and algorithms.
TimeEval takes your input and automatically creates experiment configurations by taking the cross-product of your inputs. It executes all experiment configurations one after the other or — when distributed — in parallel and records the anomaly detection quality and the runtime of the algorithms.
TimeEval takes four inputs for the experiment creation:
Algorithms
Datasets
Algorithm hyperparameter specifications
A repetition number
The following code snippet shows a simple example experiment evaluating COF and a simple baseline algorithm on some test data:
1#!/usr/bin/env python3
2from pathlib import Path
3from typing import Any, Dict
4import numpy as np
5
6from timeeval import TimeEval, DatasetManager, DefaultMetrics, Algorithm, TrainingType, InputDimensionality
7from timeeval.adapters import FunctionAdapter # for defining customized algorithm
8from timeeval.algorithms import cof
9from timeeval.data_types import AlgorithmParameter
10from timeeval.params import FixedParameters
11
12def your_algorithm_function(data: AlgorithmParameter, args: Dict[str, Any]) -> np.ndarray:
13 if isinstance(data, np.ndarray):
14 return np.zeros_like(data)
15 else: # isinstance(data, pathlib.Path)
16 return np.genfromtxt(data, delimiter=",", skip_header=1)[:, 1]
17
18def main():
19 dm = DatasetManager(Path("tests/example_data"), create_if_missing=False)
20 datasets = dm.select()
21 algorithms = [
22 # list of algorithms which will be executed on the selected dataset(s)
23 cof(
24 params=FixedParameters({
25 "n_neighbors": 20,
26 "random_state": 42})
27 ),
28 # calling customized algorithm
29 Algorithm(
30 name="MyPythonFunctionAlgorithm",
31 main=FunctionAdapter(your_algorithm_function),
32 data_as_file=False)
33 ]
34
35 timeeval = TimeEval(dm, datasets, algorithms, metrics=[DefaultMetrics.ROC_AUC, DefaultMetrics.RANGE_PR_AUC])
36 timeeval.run()
37 results = timeeval.get_results(aggregated=False)
38 print(results)
39
40
41if __name__ == "__main__":
42 main()
Features¶
Listing Example usage of TimeEval illustrates some main features of TimeEval:
- Dataset API:
Interface to available dataset collection to select datasets easily (L21-22).
- Algorithm Adapter Architecture:
TimeEval supports different algorithm adapters to execute simple Python functions or whole pipelines and applications (L27, L38).
- Hyperparameter Specification:
Algorithm hyperparameters can be specified using different search grids (L28-31).
- Metrics:
TimeEval provides various evaluation metrics (such as
timeeval.utils.metrics.DefaultMetrics.ROC_AUC
,timeeval.utils.metrics.DefaultMetrics.RANGE_PR_AUC
, ortimeeval.utils.metrics.FScoreAtK
) and measures algorithm runtimes automatically (L43).- Distributed Execution:
TimeEval can be deployed on a compute cluster to execute evaluation tasks distributedly.
Installation¶
Prerequisites:
TimeEval is published to PyPI and you can install it using:
pip install TimeEval
Attention
Currently, TimeEval is tested only on Linux systems and relies on unix-oid capabilities.
License¶
The project is licensed under the MIT license.
If you use TimeEval in your project or research, please cite our demonstration paper:
Phillip Wenig, Sebastian Schmidl, and Thorsten Papenbrock. TimeEval: A Benchmarking Toolkit for Time Series Anomaly Detection Algorithms. PVLDB, 15(12): 3678 - 3681, 2022. doi:10.14778/3554821.3554873
You can use the following BibTeX entry:
@article{WenigEtAl2022TimeEval,
title = {TimeEval: {{A}} Benchmarking Toolkit for Time Series Anomaly Detection Algorithms},
author = {Wenig, Phillip and Schmidl, Sebastian and Papenbrock, Thorsten},
date = {2022},
journaltitle = {Proceedings of the {{VLDB Endowment}} ({{PVLDB}})},
volume = {15},
number = {12},
pages = {3678--3681},
doi = {10.14778/3554821.3554873}
}
User Guide¶
New to TimeEval? Check out our User Guides to get started with TimeEval. The user guides explain TimeEval’s API and how to use it to achieve your goal.
TimeEval Concepts¶
Background information and in-depth explanations about how TimeEval works can be found in the TimeEval concepts reference.
API Reference¶
The API reference guide contains a detailed description of the functions, modules, and objects included in TimeEval. The API reference describes how the methods work and which parameters can be used.
Contributor’s Guide¶
Want to add to the codebase? You can help with the documentation? The contributing guidelines will guide you through the process of improving TimeEval and its ecosystem.
Additional Links¶
Datasets for TimeEval
TimeEval GUI (prototype)
Time series anomaly dataset generator GutenTAG