timeeval.datasets package

timeeval.datasets.analyzer

class timeeval.datasets.analyzer.DatasetAnalyzer(dataset_id: Tuple[str, str], is_train: bool, df: Optional[DataFrame] = None, dataset_path: Optional[Path] = None, dmgr: Optional[Datasets] = None, ignore_stationarity: bool = False, ignore_trend: bool = False)

Utility class to analyze a dataset and infer metadata about the dataset.

Use this class to compute necessary metadata from a time series. The computation is started directly when instantiating this class. You can access the results using the property metadata. There are multiple ways to instantiate this class, but you always have to specify the dataset ID, because it is part of the metadata:

  1. Use an existing pandas data frame object. Supply a value to the parameter df.

  2. Use a path to a time series. Supply a value to the parameter dataset_path.

  3. Use a dataset ID and a reference to the dataset manager. Supply a value to the parameter dmgr.

This class computes simple metadata, such as number of anomalies, mean, and standard deviation, as well as advanced metadata, such as trends or stationarity information for all time series channels. The simple metadata is exact. But the advanced metadata is estimated based on the observed time series data. The trend is computed by fitting linear regression models of different order to the time series. If the regression has a high enough correlation with the observed values, the trends and their confidence are recorded. The stationarity of the time series is estimated using two statistical tests, the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) and the Augmented Dickey Fuller (ADF) test.

The metadata of a dataset can be stored to disk. This class provides utility functions to create a JSON-file per dataset, containing the metadata about the test time series and the optional training time series.

Parameters
  • dataset_id (tuple of str, str) – ID of the dataset consisting of collection and dataset name.

  • is_train (bool) – If the analyzed time series is the testing or training time series of the dataset.

  • df (data frame, optional) – Time series data frame. If df is supplied, you can omit dataset_path and dmgr.

  • dataset_path (path, optional) – Path to the time series. If dataset_path is supplied, you can omit df and dmgr.

  • dmgr (Datasets object, optional) – Dataset manager instance that is used to load the time series if df and dataset_path are not specified.

  • ignore_stationarity (bool, optional) – Don’t estimate the time series’ channels stationarity. This might be necessary for large datasets, because this step takes a lot of time.

  • ignore_trend (bool, optional) – Don’t estimate the time series’ channels trend type. This might be necessary for large datasets, because this step takes a lot of time.

static load_from_json(filename: Union[str, Path], train: bool = False) DatasetMetadata

Loads existing time series metadata from disk.

If there are multiple metadata entries with the same dataset ID and training/testing-label, the first entry is used.

Parameters
Returns

metadata – Metadata of the training or testing time series.

Return type

time series metadata object

property metadata: DatasetMetadata

Returns the computed metadata about the time series.

save_to_json(filename: Union[str, Path], overwrite: bool = False) None

Save the computed metadata for a dataset to disk.

This method writes a dataset’s metadata to a JSON-formatted file to disk. The file contains a list of metadata specifications. One specification for the test time series and potentially another one for the train time series. Since the DatasetAnalyzer just analyzes a single time series at a time, this method appends the current metadata to the existing list per default. If you want to overwrite the existing content of the file, you can use the parameter overwrite.

Parameters
  • filename (path) – Path to the file, where the metadata should be written to. Might already exist.

  • overwrite (bool) – If existing data in the file should be overwritten or the current metadata should just be added to it.

timeeval.datasets.custom

class timeeval.datasets.custom.CDEntry(test_path, train_path, details)

Bases: NamedTuple

details: Dataset

Alias for field number 2

test_path: Path

Alias for field number 0

train_path: Optional[Path]

Alias for field number 1

class timeeval.datasets.custom.CustomDatasets(dataset_config: Union[str, Path])

Bases: CustomDatasetsBase

Implementation of the custom datasets API.

Internal API! You should not need to use or modify this class.

This class behaves similar to the timeeval.datasets.datasets.Datasets-API while using a different internal representation for the dataset index.

get(dataset_name: str) Dataset
get_collection_names() List[str]
get_dataset_names() List[str]
get_path(dataset_name: str, train: bool) Path
select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

timeeval.datasets.custom_base

class timeeval.datasets.custom_base.CustomDatasetsBase

Bases: ABC

API definition for custom datasets.

Internal API! You should not need to use or modify this class.

abstract get(dataset_name: str) Dataset
abstract get_collection_names() List[str]
abstract get_dataset_names() List[str]
abstract get_path(dataset_name: str, train: bool) Path
abstract select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

timeeval.datasets.custom_noop

class timeeval.datasets.custom_noop.NoOpCustomDatasets

Bases: CustomDatasetsBase

Dummy implementation of the CustomDatasets interface.

Internal API! You should not need to use or modify this class.

This dummy implementation does nothing and improves readability of the timeeval.datasets.datasets.Datasets-implementation by removing the need for None-checks.

get(dataset_name: str) Dataset
get_collection_names() List[str]
get_dataset_names() List[str]
get_path(dataset_name: str, train: bool) Path
select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

timeeval.datasets.dataset

class timeeval.datasets.dataset.Dataset(datasetId: Tuple[str, str], dataset_type: str, training_type: TrainingType, length: int, dimensions: int, contamination: float, min_anomaly_length: int, median_anomaly_length: int, max_anomaly_length: int, period_size: Optional[int] = None, num_anomalies: Optional[int] = None)

Bases: object

Dataset information containing basic metadata about the dataset.

This class is used within TimeEval heuristics to determine the heuristic values based on the dataset properties.

property collection_name: str
contamination: float
datasetId: Tuple[str, str]
dataset_type: str
dimensions: int
property has_anomalies: Optional[bool]
property input_dimensionality: InputDimensionality
length: int
max_anomaly_length: int
median_anomaly_length: int
min_anomaly_length: int
property name: str
num_anomalies: Optional[int] = None
period_size: Optional[int] = None
training_type: TrainingType

timeeval.datasets.dataset_manager

class timeeval.datasets.dataset_manager.DatasetManager(data_folder: Union[str, Path], custom_datasets_file: Optional[Union[str, Path]] = None, create_if_missing: bool = True)

Bases: ContextManager[DatasetManager], Datasets

Manages benchmark datasets and their meta-information.

Manages dataset collections and their meta-information that are stored in a single folder with an index file. You can also use this class to create a new TimeEval dataset collection.

Warning

ATTENTION: Not multi-processing-safe! There is no check for changes to the underlying dataset.csv file while this class is loaded.

Read-only access is fine with multiple processes.

Parameters
  • data_folder (path) – Path to the folder, where the benchmark data is stored. This folder consists of the file datasets.csv and the datasets in a hierarchical storage layout.

  • custom_datasets_file (path) – Path to a file listing additional custom datasets.

  • create_if_missing (bool) – Create an index-file in the data_folder if none could be found. Set this to False if an exception should be raised if the folder is wrong or does not exist.

Raises

FileNotFoundError – If create_if_missing is set to False and no datasets.csv-file was found in the data_folder.

INDEX_FILENAME: str = 'datasets.csv'
METADATA_FILENAME_SUFFIX: str = 'metadata.json'
add_dataset(dataset: DatasetRecord) None

Adds a new dataset to the benchmark dataset collection (in-memory).

The provided dataset metadata is added to this dataset collection (to the in-memory index). You can save the in-memory index to disk using the timeeval.datasets.DatasetManager.save()-method. The referenced time series files (training and testing paths) are not touched. If the same dataset ID (collection_name, dataset_name) than an existing dataset is specified, its entries are overwritten!

Parameters

dataset (DatasetRecord object) – The dataset information to add to the benchmark collection.

add_datasets(datasets: List[DatasetRecord]) None

Add a list of datasets to the dataset collection.

Add a list of new datasets to the benchmark dataset collection (in-memory). Already existing keys are overwritten!

Parameters

datasets (list of DatasetRecord objects) – List of dataset metdata to add to this dataset collection.

df() DataFrame

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:

dataset_name, collection_name

Columns:

train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns

df – All custom and benchmark datasets and their metadata.

Return type

data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)
Parameters
  • collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).

  • dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() List[str]

Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame

Loads the training/testing time series as a data frame.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() List[str]

Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray

Loads the training/testing time series as an multi-dimensional array.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path

Returns the path to the training/testing time series of the dataset.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

  • train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

  • Information about the training time series, if train=True is specified.

  • Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer

Utility class used for the extraction of metadata.

timeeval.datasets.DatasetMetadata

Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) TrainingType

Returns the training type of a specific dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

Returns

training_type – Either unsupervised, semi-supervised, or supervised.

Return type

TrainingType enum

See also

timeeval.TrainingType

Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) None

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters

file_path (path) – Path to the custom dataset configuration file.

refresh(force: bool = False) None

Re-read the benchmark dataset collection information from disk.

save() None

Saves the in-memory dataset index to disk.

Persists newly added benchmark datasets from memory to the benchmark dataset collection file datasets.csv. Custom datasets are excluded from persistence and cannot be saved to disk; use add_dataset() or add_datasets() to add datasets to the benchmark dataset collection.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters
  • collection (str) – restrict datasets to a specific collection

  • dataset (str) – restrict datasets to a specific name

  • dataset_type (str) – restrict dataset type (e.g. real or synthetic)

  • datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.

  • training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED

  • train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies

  • input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate

  • min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences

  • max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences

  • max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]

class timeeval.datasets.dataset_manager.DatasetRecord(collection_name, dataset_name, train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size)

Bases: NamedTuple

collection_name: str

Alias for field number 0

contamination: float

Alias for field number 12

count(value, /)

Return number of occurrences of value.

dataset_name: str

Alias for field number 1

dataset_type: str

Alias for field number 4

datetime_index: bool

Alias for field number 5

dimensions: int

Alias for field number 11

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

input_type: str

Alias for field number 9

length: int

Alias for field number 10

max_anomaly_length: int

Alias for field number 16

mean: float

Alias for field number 17

median_anomaly_length: int

Alias for field number 15

min_anomaly_length: int

Alias for field number 14

num_anomalies: int

Alias for field number 13

period_size: Optional[int]

Alias for field number 21

split_at: int

Alias for field number 6

stationarity: str

Alias for field number 20

stddev: float

Alias for field number 18

test_path: str

Alias for field number 3

train_is_normal: bool

Alias for field number 8

train_path: Optional[str]

Alias for field number 2

train_type: str

Alias for field number 7

trend: str

Alias for field number 19

timeeval.datasets.datasets

class timeeval.datasets.datasets.Datasets(df: DataFrame, custom_datasets_file: Optional[Union[str, Path]] = None)

Bases: ABC

Provides read-only access to benchmark datasets and their metadata.

This is an abstract class (interface). Please use timeeval.datasets.dataset_manager.DatasetManager or timeeval.datasets.multi_dataset_manager.MultiDatasetManager instead. The constructor arguments are filled in by the respective implementation.

Parameters
  • df (pandas DataFrame) – Metadata of all loaded datasets.

  • custom_datasets_file (pathlib.Path or str) – Path to a file listing additional custom datasets.

INDEX_FILENAME: str = 'datasets.csv'
METADATA_FILENAME_SUFFIX: str = 'metadata.json'
df() DataFrame

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:

dataset_name, collection_name

Columns:

train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns

df – All custom and benchmark datasets and their metadata.

Return type

data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)
Parameters
  • collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).

  • dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() List[str]

Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame

Loads the training/testing time series as a data frame.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() List[str]

Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray

Loads the training/testing time series as an multi-dimensional array.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path

Returns the path to the training/testing time series of the dataset.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

  • train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

  • Information about the training time series, if train=True is specified.

  • Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer

Utility class used for the extraction of metadata.

timeeval.datasets.DatasetMetadata

Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) TrainingType

Returns the training type of a specific dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

Returns

training_type – Either unsupervised, semi-supervised, or supervised.

Return type

TrainingType enum

See also

timeeval.TrainingType

Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) None

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters

file_path (path) – Path to the custom dataset configuration file.

abstract refresh(force: bool = False) None

Re-read the benchmark dataset collection information from disk.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters
  • collection (str) – restrict datasets to a specific collection

  • dataset (str) – restrict datasets to a specific name

  • dataset_type (str) – restrict dataset type (e.g. real or synthetic)

  • datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.

  • training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED

  • train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies

  • input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate

  • min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences

  • max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences

  • max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]

timeeval.datasets.metadata

class timeeval.datasets.metadata.AnomalyLength(min: int, median: int, max: int)

Bases: object

max: int
median: int
min: int
class timeeval.datasets.metadata.DatasetMetadata(dataset_id: Tuple[str, str], is_train: bool, length: int, dimensions: int, contamination: float, num_anomalies: int, anomaly_length: AnomalyLength, means: Dict[str, float], stddevs: Dict[str, float], trends: Dict[str, List[Trend]], stationarities: Dict[str, Stationarity])

Bases: object

Represents the metadata of a single time series of a dataset (for each channel).

anomaly_length: AnomalyLength
property channels: int
contamination: float
dataset_id: Tuple[str, str]
dimensions: int
static from_json(s: str) DatasetMetadata
get_stationarity_name() str
is_train: bool
length: int
property mean: float
means: Dict[str, float]
num_anomalies: int
property shape: Tuple[int, int]
stationarities: Dict[str, Stationarity]
property stationarity: Stationarity
property stddev: float
stddevs: Dict[str, float]
to_json(pretty: bool = False) str
property trend: str
trends: Dict[str, List[Trend]]
class timeeval.datasets.metadata.DatasetMetadataEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: JSONEncoder

default(o: Any) Any

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
static object_hook(dct: Dict[str, Any]) Any
class timeeval.datasets.metadata.Stationarity(value)

Bases: Enum

An enumeration.

DIFFERENCE_STATIONARY = 1
NOT_STATIONARY = 3
STATIONARY = 0
TREND_STATIONARY = 2
static from_name(s: int) Stationarity
class timeeval.datasets.metadata.Trend(tpe: timeeval.datasets.metadata.TrendType, coef: float, confidence_r2: float)

Bases: object

coef: float
confidence_r2: float
property name: str
property order: int
tpe: TrendType
class timeeval.datasets.metadata.TrendType(value)

Bases: Enum

An enumeration.

KUBIC = 3
LINEAR = 1
QUADRATIC = 2
static from_order(order: int) TrendType

timeeval.datasets.multi_dataset_manager

class timeeval.datasets.multi_dataset_manager.MultiDatasetManager(data_folders: List[Union[str, Path]], custom_datasets_file: Optional[Union[str, Path]] = None)

Bases: Datasets

Provides read-only access to multiple benchmark datasets collections and their meta-information.

Manages dataset collections and their meta-information that are stored in multiple folders. The entries in all index files must be unique and are NOT allowed to overlap! This would lead to information loss!

Parameters
  • data_folders (list of paths) – List of data paths that hold the datasets and the index files.

  • custom_datasets_file (path) – Path to a file listing additional custom datasets.

Raises

FileNotFoundError – If the datasets.csv-file was not found in any of the data_folders.

See also

timeeval.datasets.Datasets, timeeval.datasets.DatasetManager

INDEX_FILENAME: str = 'datasets.csv'
METADATA_FILENAME_SUFFIX: str = 'metadata.json'
df() DataFrame

Returns a copy of the internal dataset metadata collection.

The DataFrame has the following schema:

Index:

dataset_name, collection_name

Columns:

train_path, test_path, dataset_type, datetime_index, split_at, train_type, train_is_normal, input_type, length, dimensions, contamination, num_anomalies, min_anomaly_length, median_anomaly_length, max_anomaly_length, mean, stddev, trend, stationarity, period_size

Returns

df – All custom and benchmark datasets and their metadata.

Return type

data frame

get(collection_name: Union[str, Tuple[str, str]], dataset_name: Optional[str] = None) Dataset

Returns dataset metadata.

Examples

>>> from timeeval.datasets import DatasetManager
>>> dm = DatasetManager("path/to/datasets")
>>> dataset_id = ("custom", "dataset1")

Access using the dataset ID:

>>> dm.get(dataset_id)
Dataset(datsetId=("custom", "dataset1"), ...)

Access using collection and dataset name:

>>> dm.get("custom", "dataset1")
Dataset(datsetId=("custom", "dataset1"), ...)
Parameters
  • collection_name (str or tuple of str and str) – Name of the dataset collection or the dataset ID (collection, and dataset name).

  • dataset_name (str, optional) – Name of the dataset or empty.

Returns

dataset – The dataset metadata

Return type

a dataset object

get_collection_names() List[str]

Returns the unique dataset collection names (includes custom datasets if present).

get_dataset_df(dataset_id: Tuple[str, str], train: bool = False) DataFrame

Loads the training/testing time series as a data frame.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

df – The training or testing time series as a pandas.DataFrame.

Return type

data frame

get_dataset_names() List[str]

Returns the unique dataset names (includes custom datasets if present).

get_dataset_ndarray(dataset_id: Tuple[str, str], train: bool = False) ndarray

Loads the training/testing time series as an multi-dimensional array.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

values – The training or testing time series as a multi-dimensional array.

Return type

ndarray

get_dataset_path(dataset_id: Tuple[str, str], train: bool = False) Path

Returns the path to the training/testing time series of the dataset.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

  • train (bool) – Whether the training (True) or testing (False, default) should be returned.

Returns

dataset_path – The path to the training or testing time series.

Return type

path

get_detailed_metadata(dataset_id: Tuple[str, str], train: bool = False) DatasetMetadata

Computes detailed metadata about the training or testing time series of a dataset.

For most of the benchmark datasets, the detailed metadata is pre-computed and just has to be loaded from disk. For all other datasets, the time series is analyzed on the fly using timeeval.datasets.DatasetAnalyzer and the result is saved back to disk for later reuse. The metadata about custom datasets is not cached on disk! The following additional metadata is provided:

  • Information about the training time series, if train=True is specified.

  • Mean, variance, trend, and stationarity information for each channel of the time series individually.

Parameters
  • dataset_id (tuple of str, str) – Dataset ID (collection and dataset name).

  • train (bool) – Whether the training (True) or testing (False, default) should be loaded.

Returns

metadata – Detailed metadata about the training or testing time series.

Return type

dataset metadata object

See also

timeeval.datasets.DatasetAnalyzer

Utility class used for the extraction of metadata.

timeeval.datasets.DatasetMetadata

Data class of the returned result.

get_training_type(dataset_id: Tuple[str, str]) TrainingType

Returns the training type of a specific dataset.

Parameters

dataset_id (tuple of str, str) – Dataset ID (collection and dataset name)

Returns

training_type – Either unsupervised, semi-supervised, or supervised.

Return type

TrainingType enum

See also

timeeval.TrainingType

Enumeration of training types that could be returned by this method.

load_custom_datasets(file_path: Union[str, Path]) None

Reads a configuration file that contains additional datasets and adds them to the current dataset index.

You can add custom datasets to the dataset manager either using a constructor argument or using this method. The datasets from the configuration file are added to the internal dataset index and are then available for querying. The configuration file uses the JSON schema and supports this structure:

{
    "dataset_name": {
        "test_path": "./path/to/test.csv",
        "train_path": "./path/to/train.csv",
        "type": "synthetic",
        "period": 10
    }
}

The properties train_path, type, and period are optional. Dataset names must be unqiue within the configuration file. The datasets are automatically assigned to the custom dataset collection.

Warning

Repeated calls to this method overwrite the existing custom dataset list.

Parameters

file_path (path) – Path to the custom dataset configuration file.

refresh(force: bool = False) None

Re-read the benchmark dataset collection information from disk.

select(collection: Optional[str] = None, dataset: Optional[str] = None, dataset_type: Optional[str] = None, datetime_index: Optional[bool] = None, training_type: Optional[TrainingType] = None, train_is_normal: Optional[bool] = None, input_dimensionality: Optional[InputDimensionality] = None, min_anomalies: Optional[int] = None, max_anomalies: Optional[int] = None, max_contamination: Optional[float] = None) List[Tuple[str, str]]

Returns a list of dataset identifiers from the dataset collection whose datasets match all of the given conditions.

Parameters
  • collection (str) – restrict datasets to a specific collection

  • dataset (str) – restrict datasets to a specific name

  • dataset_type (str) – restrict dataset type (e.g. real or synthetic)

  • datetime_index (bool) – only select datasets for which a datetime index exists; if True: “timestamp”-column has datetime values; if False: “timestamp”-column has monotonically increasing integer values; this condition is ignored by custom datasets.

  • training_type (timeeval.TrainingType) – select datasets for specific training needs: * SUPERVISED, * SEMI_SUPERVISED, or * UNSUPERVISED

  • train_is_normal (bool) – if True: only return datasets for which the training dataset does not contain anomalies; if False: only return datasets for which the training dataset contains anomalies

  • input_dimensionality (timeeval.InputDimensionality) – restrict dataset to input type, either univariate or multivariate

  • min_anomalies (int) – restrict datasets to those with a minimum number of min_anomalies anomalous subsequences

  • max_anomalies (int) – restrict datasets to those with a maximum number of max_anomalies anomalous subsequences

  • max_contamination (int) – restrict datasets to those having a contamination smaller or equal to max_contamination

Returns

dataset_names – A list of dataset identifiers (tuple of collection name and dataset name).

Return type

List[Tuple[str,str]]