timeeval.datasets package¶

timeeval.datasets.analyzer¶

class timeeval.datasets.analyzer.DatasetAnalyzer(dataset_id: Tuple[str, str], is_train: bool, df: Optional[DataFrame] = None, dataset_path: Optional[Path] = None, dmgr: Optional[Datasets] = None, ignore_stationarity: bool = False, ignore_trend: bool = False)¶

Utility class to analyze a dataset and infer metadata about the dataset.

Use this class to compute necessary metadata from a time series. The computation is started directly when instantiating this class. You can access the results using the property metadata. There are multiple ways to instantiate this class, but you always have to specify the dataset ID, because it is part of the metadata:

Use an existing pandas data frame object. Supply a value to the parameter df.

Use a path to a time series. Supply a value to the parameter dataset_path.

Use a dataset ID and a reference to the dataset manager. Supply a value to the parameter dmgr.

This class computes simple metadata, such as number of anomalies, mean, and standard deviation, as well as advanced metadata, such as trends or stationarity information for all time series channels. The simple metadata is exact. But the advanced metadata is estimated based on the observed time series data. The trend is computed by fitting linear regression models of different order to the time series. If the regression has a high enough correlation with the observed values, the trends and their confidence are recorded. The stationarity of the time series is estimated using two statistical tests, the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) and the Augmented Dickey Fuller (ADF) test.

The metadata of a dataset can be stored to disk. This class provides utility functions to create a JSON-file per dataset, containing the metadata about the test time series and the optional training time series.

Parameters

dataset_id (tuple of str, str) – ID of the dataset consisting of collection and dataset name.
is_train (bool) – If the analyzed time series is the testing or training time series of the dataset.
df (data frame, optional) – Time series data frame. If df is supplied, you can omit dataset_path and dmgr.
dataset_path (path, optional) – Path to the time series. If dataset_path is supplied, you can omit df and dmgr.
dmgr (Datasets object, optional) – Dataset manager instance that is used to load the time series if df and dataset_path are not specified.
ignore_stationarity (bool, optional) – Don’t estimate the time series’ channels stationarity. This might be necessary for large datasets, because this step takes a lot of time.
ignore_trend (bool, optional) – Don’t estimate the time series’ channels trend type. This might be necessary for large datasets, because this step takes a lot of time.

static load_from_json(filename: Union[str, Path], train: bool = False) → DatasetMetadata¶

Loads existing time series metadata from disk.

If there are multiple metadata entries with the same dataset ID and training/testing-label, the first entry is used.

Parameters

filename (path) – Path to the JSON-file containing the dataset metadata. Can be written using timeeval.datasets.analyzer.DatasetAnalyzer.save_to_json().
train (bool) – Whether the training or testing time series’ metadata should be loaded from the file.

Returns

metadata – Metadata of the training or testing time series.

Return type

time series metadata object

property metadata: DatasetMetadata¶: Returns the computed metadata about the time series.

save_to_json(filename: Union[str, Path], overwrite: bool = False) → None¶

Save the computed metadata for a dataset to disk.

This method writes a dataset’s metadata to a JSON-formatted file to disk. The file contains a list of metadata specifications. One specification for the test time series and potentially another one for the train time series. Since the DatasetAnalyzer just analyzes a single time series at a time, this method appends the current metadata to the existing list per default. If you want to overwrite the existing content of the file, you can use the parameter overwrite.

Parameters

filename (path) – Path to the file, where the metadata should be written to. Might already exist.
overwrite (bool) – If existing data in the file should be overwritten or the current metadata should just be added to it.

timeeval.datasets.custom¶

class timeeval.datasets.custom.CDEntry(test_path, train_path, details)¶

Bases: NamedTuple

details: Dataset¶: Alias for field number 2

test_path: Path¶: Alias for field number 0

train_path: Optional[Path]¶: Alias for field number 1

class timeeval.datasets.custom.CustomDatasets(dataset_config: Union[str, Path])¶

Bases: CustomDatasetsBase

Implementation of the custom datasets API.

Internal API! You should not need to use or modify this class.

This class behaves similar to the timeeval.datasets.datasets.Datasets-API while using a different internal representation for the dataset index.

get(dataset_name: str) → Dataset¶

get_collection_names() → List[str]¶

get_dataset_names() → List[str]¶

get_path(dataset_name: str, train: bool) → Path¶

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

timeeval.datasets.custom_base¶

class timeeval.datasets.custom_base.CustomDatasetsBase¶

Bases: ABC

API definition for custom datasets.

Internal API! You should not need to use or modify this class.

abstract get(dataset_name: str) → Dataset¶

abstract get_collection_names() → List[str]¶

abstract get_dataset_names() → List[str]¶

abstract get_path(dataset_name: str, train: bool) → Path¶

abstract select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

timeeval.datasets.custom_noop¶

class timeeval.datasets.custom_noop.NoOpCustomDatasets¶

Bases: CustomDatasetsBase

Dummy implementation of the CustomDatasets interface.

Internal API! You should not need to use or modify this class.

This dummy implementation does nothing and improves readability of the timeeval.datasets.datasets.Datasets-implementation by removing the need for None-checks.

get(dataset_name: str) → Dataset¶

get_collection_names() → List[str]¶

get_dataset_names() → List[str]¶

get_path(dataset_name: str, train: bool) → Path¶

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

timeeval.datasets.dataset¶

class timeeval.datasets.dataset.Dataset(datasetId: Tuple[str, str], dataset_type: str, training_type: TrainingType, length: int, dimensions: int, contamination: float, min_anomaly_length: int, median_anomaly_length: int, max_anomaly_length: int, period_size: Optional[int] = None, num_anomalies: Optional[int] = None)¶

Bases: object

Dataset information containing basic metadata about the dataset.

This class is used within TimeEval heuristics to determine the heuristic values based on the dataset properties.

property collection_name: str¶

contamination: float¶

datasetId: Tuple[str, str]¶

dataset_type: str¶

dimensions: int¶

property has_anomalies: Optional[bool]¶

property input_dimensionality: InputDimensionality¶

length: int¶

max_anomaly_length: int¶

median_anomaly_length: int¶

min_anomaly_length: int¶

property name: str¶

num_anomalies: Optional[int] = None¶

period_size: Optional[int] = None¶

training_type: TrainingType¶

timeeval.datasets.dataset_manager¶

class timeeval.datasets.dataset_manager.DatasetManager(data_folder: Union[str, Path], custom_datasets_file: Optional[Union[str, Path]] = None, create_if_missing: bool = True)¶

Bases: ContextManager[DatasetManager], Datasets

Manages benchmark datasets and their meta-information.

Manages dataset collections and their meta-information that are stored in a single folder with an index file. You can also use this class to create a new TimeEval dataset collection.

Warning

ATTENTION: Not multi-processing-safe! There is no check for changes to the underlying dataset.csv file while this class is loaded.

Read-only access is fine with multiple processes.

Parameters

data_folder (path) – Path to the folder, where the benchmark data is stored. This folder consists of the file datasets.csv and the datasets in a hierarchical storage layout.
custom_datasets_file (path) – Path to a file listing additional custom datasets.
create_if_missing (bool) – Create an index-file in the data_folder if none could be found. Set this to False if an exception should be raised if the folder is wrong or does not exist.

Raises

FileNotFoundError – If create_if_missing is set to False and no datasets.csv-file was found in the data_folder.

INDEX_FILENAME: str = 'datasets.csv'¶

METADATA_FILENAME_SUFFIX: str = 'metadata.json'¶

add_dataset(dataset: DatasetRecord) → None¶

Adds a new dataset to the benchmark dataset collection (in-memory).

The provided dataset metadata is added to this dataset collection (to the in-memory index). You can save the in-memory index to disk using the timeeval.datasets.DatasetManager.save()-method. The referenced time series files (training and testing paths) are not touched. If the same dataset ID (collection_name, dataset_name) than an existing dataset is specified, its entries are overwritten!

Parameters: dataset (DatasetRecord object) – The dataset information to add to the benchmark collection.

add_datasets(datasets: List[DatasetRecord]) → None¶

Add a list of datasets to the dataset collection.

Add a list of new datasets to the benchmark dataset collection (in-memory). Already existing keys are overwritten!

Parameters: datasets (list of DatasetRecord objects) – List of dataset metdata to add to this dataset collection.

df() → DataFrame¶

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:: dataset_name, collection_name
Columns:: train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns: df – All custom and benchmark datasets and their metadata.
Return type: data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) → Dataset¶

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)

Parameters

collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).
dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() → List[str]¶: Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) → DataFrame¶

Loads the training/testing time series as a data frame.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() → List[str]¶: Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) → ndarray¶

Loads the training/testing time series as an multi-dimensional array.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) → Path¶

Returns the path to the training/testing time series of the dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) → DatasetMetadata¶

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

Information about the training time series, if train=True is specified.
Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer: Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata: Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) → TrainingType¶

Returns the training type of a specific dataset.

Parameters: dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
Returns: training_type – Either unsupervised, semi-supervised, or supervised.
Return type: TrainingType enum

See also

timeeval.TrainingType: Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) → None¶

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters: file_path (path) – Path to the custom dataset configuration file.

refresh(force: bool = False) → None¶: Re-read the benchmark dataset collection information from disk.

save() → None¶

Saves the in-memory dataset index to disk.

Persists newly added benchmark datasets from memory to the benchmark dataset collection file datasets.csv. Custom datasets are excluded from persistence and cannot be saved to disk; use add_dataset() or add_datasets() to add datasets to the benchmark dataset collection.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters

collection (str) – restrict datasets to a specific collection
dataset (str) – restrict datasets to a specific name
dataset_type (str) – restrict dataset type (e.g. real or synthetic)
datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.
training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED
train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies
input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate
min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences
max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences
max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]

class timeeval.datasets.dataset_manager.DatasetRecord(collection_name, dataset_name, train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size)¶

Bases: NamedTuple

collection_name: str¶: Alias for field number 0

contamination: float¶: Alias for field number 12

count(value, /)¶: Return number of occurrences of value.

dataset_name: str¶: Alias for field number 1

dataset_type: str¶: Alias for field number 4

datetime_index: bool¶: Alias for field number 5

dimensions: int¶: Alias for field number 11

index(value, start=0, stop=9223372036854775807, /)¶

Return first index of value.

Raises ValueError if the value is not present.

input_type: str¶: Alias for field number 9

length: int¶: Alias for field number 10

max_anomaly_length: int¶: Alias for field number 16

mean: float¶: Alias for field number 17

median_anomaly_length: int¶: Alias for field number 15

min_anomaly_length: int¶: Alias for field number 14

num_anomalies: int¶: Alias for field number 13

period_size: Optional[int]¶: Alias for field number 21

split_at: int¶: Alias for field number 6

stationarity: str¶: Alias for field number 20

stddev: float¶: Alias for field number 18

test_path: str¶: Alias for field number 3

train_is_normal: bool¶: Alias for field number 8

train_path: Optional[str]¶: Alias for field number 2

train_type: str¶: Alias for field number 7

trend: str¶: Alias for field number 19

timeeval.datasets.datasets¶

class timeeval.datasets.datasets.Datasets(df: DataFrame, custom_datasets_file: Optional[Union[str, Path]] = None)¶

Bases: ABC

Provides read-only access to benchmark datasets and their metadata.

This is an abstract class (interface). Please use timeeval.datasets.dataset_manager.DatasetManager or timeeval.datasets.multi_dataset_manager.MultiDatasetManager instead. The constructor arguments are filled in by the respective implementation.

Parameters

df (pandas DataFrame) – Metadata of all loaded datasets.
custom_datasets_file (pathlib.Path or str) – Path to a file listing additional custom datasets.

INDEX_FILENAME: str = 'datasets.csv'¶

METADATA_FILENAME_SUFFIX: str = 'metadata.json'¶

df() → DataFrame¶

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:: dataset_name, collection_name
Columns:: train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns: df – All custom and benchmark datasets and their metadata.
Return type: data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) → Dataset¶

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)

Parameters

collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).
dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() → List[str]¶: Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) → DataFrame¶

Loads the training/testing time series as a data frame.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() → List[str]¶: Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) → ndarray¶

Loads the training/testing time series as an multi-dimensional array.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) → Path¶

Returns the path to the training/testing time series of the dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) → DatasetMetadata¶

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

Information about the training time series, if train=True is specified.
Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer: Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata: Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) → TrainingType¶

Returns the training type of a specific dataset.

Parameters: dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
Returns: training_type – Either unsupervised, semi-supervised, or supervised.
Return type: TrainingType enum

See also

timeeval.TrainingType: Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) → None¶

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters: file_path (path) – Path to the custom dataset configuration file.

abstract refresh(force: bool = False) → None¶: Re-read the benchmark dataset collection information from disk.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters

collection (str) – restrict datasets to a specific collection
dataset (str) – restrict datasets to a specific name
dataset_type (str) – restrict dataset type (e.g. real or synthetic)
datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.
training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED
train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies
input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate
min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences
max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences
max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]

timeeval.datasets.metadata¶

class timeeval.datasets.metadata.AnomalyLength(min: int, median: int, max: int)¶

Bases: object

max: int¶

median: int¶

min: int¶

class timeeval.datasets.metadata.DatasetMetadata(dataset_id: Tuple[str, str], is_train: bool, length: int, dimensions: int, contamination: float, num_anomalies: int, anomaly_length: AnomalyLength, means: Dict[str, float], stddevs: Dict[str, float], trends: Dict[str, List[Trend]], stationarities: Dict[str, Stationarity])¶

Bases: object

Represents the metadata of a single time series of a dataset (for each channel).

anomaly_length: AnomalyLength¶

property channels: int¶

contamination: float¶

dataset_id: Tuple[str, str]¶

dimensions: int¶

static from_json(s: str) → DatasetMetadata¶

get_stationarity_name() → str¶

is_train: bool¶

length: int¶

property mean: float¶

means: Dict[str, float]¶

num_anomalies: int¶

property shape: Tuple[int, int]¶

stationarities: Dict[str, Stationarity]¶

property stationarity: Stationarity¶

property stddev: float¶

stddevs: Dict[str, float]¶

to_json(pretty: bool = False) → str¶

property trend: str¶

trends: Dict[str, List[Trend]]¶

class timeeval.datasets.metadata.DatasetMetadataEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)¶

Bases: JSONEncoder

default(o: Any) → Any¶

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

static object_hook(dct: Dict[str, Any]) → Any¶

class timeeval.datasets.metadata.Stationarity(value)¶

Bases: Enum

An enumeration.

DIFFERENCE_STATIONARY = 1¶

NOT_STATIONARY = 3¶

STATIONARY = 0¶

TREND_STATIONARY = 2¶

static from_name(s: int) → Stationarity¶

class timeeval.datasets.metadata.Trend(tpe: timeeval.datasets.metadata.TrendType, coef: float, confidence_r2: float)¶

Bases: object

coef: float¶

confidence_r2: float¶

property name: str¶

property order: int¶

tpe: TrendType¶

class timeeval.datasets.metadata.TrendType(value)¶

Bases: Enum

An enumeration.

KUBIC = 3¶

LINEAR = 1¶

QUADRATIC = 2¶

static from_order(order: int) → TrendType¶

timeeval.datasets.multi_dataset_manager¶

class timeeval.datasets.multi_dataset_manager.MultiDatasetManager(data_folders: List[Union[str, Path]], custom_datasets_file: Optional[Union[str, Path]] = None)¶

Bases: Datasets

Provides read-only access to multiple benchmark datasets collections and their meta-information.

Manages dataset collections and their meta-information that are stored in multiple folders. The entries in all index files must be unique and are NOT allowed to overlap! This would lead to information loss!

Parameters

data_folders (list of paths) – List of data paths that hold the datasets and the index files.
custom_datasets_file (path) – Path to a file listing additional custom datasets.

Raises

FileNotFoundError – If the datasets.csv-file was not found in any of the data_folders.

See also

timeeval.datasets.Datasets, timeeval.datasets.DatasetManager

INDEX_FILENAME: str = 'datasets.csv'¶

METADATA_FILENAME_SUFFIX: str = 'metadata.json'¶

df() → DataFrame¶

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:: dataset_name, collection_name
Columns:: train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns: df – All custom and benchmark datasets and their metadata.
Return type: data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) → Dataset¶

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)

Parameters

collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).
dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() → List[str]¶: Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) → DataFrame¶

Loads the training/testing time series as a data frame.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() → List[str]¶: Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) → ndarray¶

Loads the training/testing time series as an multi-dimensional array.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) → Path¶

Returns the path to the training/testing time series of the dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) → DatasetMetadata¶

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

Information about the training time series, if train=True is specified.
Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).
train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer: Utility class used for the extraction of metadata.
timeeval.datasets.DatasetMetadata: Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) → TrainingType¶

Returns the training type of a specific dataset.

Parameters: dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)
Returns: training_type – Either unsupervised, semi-supervised, or supervised.
Return type: TrainingType enum

See also

timeeval.TrainingType: Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) → None¶

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters: file_path (path) – Path to the custom dataset configuration file.

refresh(force: bool = False) → None¶: Re-read the benchmark dataset collection information from disk.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) → List[Tuple[str, str]]¶

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters

collection (str) – restrict datasets to a specific collection
dataset (str) – restrict datasets to a specific name
dataset_type (str) – restrict dataset type (e.g. real or synthetic)
datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.
training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED
train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies
input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate
min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences
max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences
max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]