STUMPY API#

Overview

stumpy.stump

Compute the z-normalized matrix profile

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask/ray cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

stumpy.mass

Compute the distance profile using the MASS algorithm

stumpy.scrump

Compute an approximate z-normalized matrix profile

stumpy.stumpi

Compute an incremental z-normalized matrix profile for streaming data

stumpy.mstump

Compute the multi-dimensional z-normalized matrix profile

stumpy.mstumped

Compute the multi-dimensional z-normalized matrix profile with a distributed dask/ray cluster

stumpy.subspace

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

stumpy.mdl

Compute the multi-dimensional number of bits needed to compress one multi-dimensional subsequence with another along each of the k-dimensions using the minimum description length (MDL)

stumpy.atsc

Compute the anchored time series chain (ATSC)

stumpy.allc

Compute the all-chain set (ALLC)

stumpy.fluss

Compute the Fast Low-cost Unipotent Semantic Segmentation (FLUSS) for static data (i.e., batch processing)

stumpy.floss

Compute the Fast Low-cost Online Semantic Segmentation (FLOSS) for streaming data

stumpy.ostinato

Find the z-normalized consensus motif of multiple time series

stumpy.ostinatoed

Find the z-normalized consensus motif of multiple time series with a distributed cluster

stumpy.gpu_ostinato

Find the z-normalized consensus motif of multiple time series with one or more GPU devices

stumpy.mpdist

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series

stumpy.mpdisted

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with a distributed dask/ray cluster

stumpy.gpu_mpdist

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with one or more GPU devices

stumpy.motifs

Discover the top motifs for time series T

stumpy.match

Find all matches of a query Q in a time series T

stumpy.mmotifs

Discover the top motifs for the multi-dimensional time series T

stumpy.snippets

Identify the top k snippets that best represent the time series, T

stumpy.stimp

Compute the Pan Matrix Profile

stumpy.stimped

Compute the Pan Matrix Profile with a distributed dask/ray cluster

stumpy.gpu_stimp

Compute the Pan Matrix Profile with with one or more GPU devices

stump#

stumpy.stump(T_A, m, T_B=None, ignore_trivial=True, normalize=True, p=2.0, k=1, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)[source]#

Compute the z-normalized matrix profile

This is a convenience wrapper around the Numba JIT-compiled parallelized _stump function which computes the (top-k) matrix profile according to STOMPopt with Pearson correlations.

Parameters:
  • T_A (numpy.ndarray) – The time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • T_B (numpy.ndarray, default None) – The time series or sequence that will be used to annotate T_A. For every subsequence in T_A, its nearest neighbor in T_B will be recorded. Default is None which corresponds to a self-join.

  • ignore_trivial (bool, default True) – Set to True if this is a self-join. Otherwise, for AB-join, set this to False. Default is True.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • k (int, default 1) – The number of top k smallest distances used to construct the matrix profile. Note that this will increase the total computational time and memory usage when k > 1. If you have access to a GPU device, then you may be able to leverage gpu_stump for better performance and scalability.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

out – When k = 1 (default), the first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices. However, when k > 1, the output array will contain exactly 2 * k + 2 columns. The first k columns (i.e., out[:, :k]) consists of the top-k matrix profile, the next set of k columns (i.e., out[:, k:2k]) consists of the corresponding top-k matrix profile indices, and the last two columns (i.e., out[:, 2k] and out[:, 2k+1] or, equivalently, out[:, -2] and out[:, -1]) correspond to the top-1 left matrix profile indices and the top-1 right matrix profile indices, respectively.

Return type:

numpy.ndarray

See also

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

stumpy.scrump

Compute an approximate z-normalized matrix profile

Notes

DOI: 10.1007/s10115-017-1138-x

See Section 4.5

The above reference outlines a general approach for traversing the distance matrix in a diagonal fashion rather than in a row-wise fashion.

DOI: 10.1145/3357223.3362721

See Section 3.1 and Section 3.3

The above reference outlines the use of the Pearson correlation via Welford’s centered sum-of-products along each diagonal of the distance matrix in place of the sliding window dot product found in the original STOMP method.

DOI: 10.1109/ICDM.2016.0085

See Table II

Timeseries, T_A, will be annotated with the distance location (or index) of all its subsequences in another times series, T_B.

Return: For every subsequence, Q, in T_A, you will get a distance and index for the closest subsequence in T_B. Thus, the array returned will have length T_A.shape[0]-m+1. Additionally, the left and right matrix profiles are also returned.

Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.

Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).

For self-joins, set ignore_trivial = True in order to avoid the trivial match.

Note that left and right matrix profiles are only available for self-joins.

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.stump(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3)
array([[0.11633857113691416, 4, -1, 4],
       [2.694073918063438, 3, -1, 3],
       [3.0000926340485923, 0, 0, 4],
       [2.694073918063438, 1, 1, -1],
       [0.11633857113691416, 0, 0, -1]], dtype=object)

stumped#

stumpy.stumped(client, T_A, m, T_B=None, ignore_trivial=True, normalize=True, p=2.0, k=1, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)[source]#

Compute the z-normalized matrix profile with a distributed dask/ray cluster

This is a highly distributed implementation around the Numba JIT-compiled parallelized _stump function which computes the (top-k) matrix profile according to STOMPopt with Pearson correlations.

Parameters:
  • client (client) – A Dask or Ray Distributed client. Setting up a distributed cluster is beyond the scope of this library. Please refer to the Dask or Ray Distributed documentation.

  • T_A (numpy.ndarray) – The time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • T_B (numpy.ndarray, default None) – The time series or sequence that will be used to annotate T_A. For every subsequence in T_A, its nearest neighbor in T_B will be recorded. Default is None which corresponds to a self-join.

  • ignore_trivial (bool, default True) – Set to True if this is a self-join. Otherwise, for AB-join, set this to False. Default is True.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • k (int, default 1) – The number of top k smallest distances used to construct the matrix profile. Note that this will increase the total computational time and memory usage when k > 1. If you have access to a GPU device, then you may be able to leverage gpu_stump for better performance and scalability.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

out – When k = 1 (default), the first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices. However, when k > 1, the output array will contain exactly 2 * k + 2 columns. The first k columns (i.e., out[:, :k]) consists of the top-k matrix profile, the next set of k columns (i.e., out[:, k:2k]) consists of the corresponding top-k matrix profile indices, and the last two columns (i.e., out[:, 2k] and out[:, 2k+1] or, equivalently, out[:, -2] and out[:, -1]) correspond to the top-1 left matrix profile indices and the top-1 right matrix profile indices, respectively.

Return type:

numpy.ndarray

See also

stumpy.stump

Compute the z-normalized matrix profile cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

stumpy.scrump

Compute an approximate z-normalized matrix profile

Notes

DOI: 10.1007/s10115-017-1138-x

See Section 4.5

The above reference outlines a general approach for traversing the distance matrix in a diagonal fashion rather than in a row-wise fashion.

DOI: 10.1145/3357223.3362721

See Section 3.1 and Section 3.3

The above reference outlines the use of the Pearson correlation via Welford’s centered sum-of-products along each diagonal of the distance matrix in place of the sliding window dot product found in the original STOMP method.

DOI: 10.1109/ICDM.2016.0085

See Table II

This is a Dask distributed implementation of stump that scales across multiple servers and is a convenience wrapper around the parallelized stump._stump function

Timeseries, T_A, will be annotated with the distance location (or index) of all its subsequences in another times series, T_B.

Return: For every subsequence, Q, in T_A, you will get a distance and index for the closest subsequence in T_B. Thus, the array returned will have length T_A.shape[0]-m+1. Additionally, the left and right matrix profiles are also returned.

Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.

Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).

For self-joins, set ignore_trivial = True in order to avoid the trivial match.

Note that left and right matrix profiles are only available for self-joins.

Examples

>>> import stumpy
>>> import numpy as np
>>> from dask.distributed import Client
>>> if __name__ == "__main__":
...     with Client() as dask_client:
...         stumpy.stumped(
...             dask_client,
...             np.array([584., -11., 23., 79., 1001., 0., -19.]),
...             m=3)
array([[0.11633857113691416, 4, -1, 4],
       [2.694073918063438, 3, -1, 3],
       [3.0000926340485923, 0, 0, 4],
       [2.694073918063438, 1, 1, -1],
       [0.11633857113691416, 0, 0, -1]], dtype=object)

gpu_stump#

stumpy.gpu_stump(T_A, m, T_B=None, ignore_trivial=True, device_id=0, normalize=True, p=2.0, k=1, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)#

Compute the z-normalized matrix profile with one or more GPU devices

This is a convenience wrapper around the Numba cuda.jit _gpu_stump function which computes the matrix profile according to GPU-STOMP. The default number of threads-per-block is set to 512 and may be changed by setting the global parameter config.STUMPY_THREADS_PER_BLOCK to an appropriate number based on your GPU hardware.

Parameters:
  • T_A (numpy.ndarray) – The time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • T_B (numpy.ndarray, default None) – The time series or sequence that will be used to annotate T_A. For every subsequence in T_A, its nearest neighbor in T_B will be recorded. Default is None which corresponds to a self-join.

  • ignore_trivial (bool, default True) – Set to True if this is a self-join. Otherwise, for AB-join, set this to False. Default is True.

  • device_id (int or list, default 0) – The (GPU) device number to use. The default value is 0. A list of valid device ids (int) may also be provided for parallel GPU-STUMP computation. A list of all valid device ids can be obtained by executing [device.id for device in numba.cuda.list_devices()].

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • k (int, default 1) – The number of top k smallest distances used to construct the matrix profile. Note that this will increase the total computational time and memory usage when k > 1.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

out – When k = 1 (default), the first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices. However, when k > 1, the output array will contain exactly 2 * k + 2 columns. The first k columns (i.e., out[:, :k]) consists of the top-k matrix profile, the next set of k columns (i.e., out[:, k:2k]) consists of the corresponding top-k matrix profile indices, and the last two columns (i.e., out[:, 2k] and out[:, 2k+1] or, equivalently, out[:, -2] and out[:, -1]) correspond to the top-1 left matrix profile indices and the top-1 right matrix profile indices, respectively.

Return type:

numpy.ndarray

See also

stumpy.stump

Compute the z-normalized matrix profile

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask cluster

stumpy.scrump

Compute an approximate z-normalized matrix profile

Notes

DOI: 10.1109/ICDM.2016.0085

See Table II, Figure 5, and Figure 6

Timeseries, T_A, will be annotated with the distance location (or index) of all its subsequences in another times series, T_B.

Return: For every subsequence, Q, in T_A, you will get a distance and index for the closest subsequence in T_B. Thus, the array returned will have length T_A.shape[0]-m+1. Additionally, the left and right matrix profiles are also returned.

Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.

Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).

For self-joins, set ignore_trivial = True in order to avoid the trivial match.

Note that left and right matrix profiles are only available for self-joins.

Examples

>>> import stumpy
>>> import numpy as np
>>> from numba import cuda
>>> if __name__ == "__main__":
...     all_gpu_devices = [device.id for device in cuda.list_devices()]
...     stumpy.gpu_stump(
...         np.array([584., -11., 23., 79., 1001., 0., -19.]),
...         m=3,
...         device_id=all_gpu_devices)
array([[0.11633857113691416, 4, -1, 4],
       [2.694073918063438, 3, -1, 3],
       [3.0000926340485923, 0, 0, 4],
       [2.694073918063438, 1, 1, -1],
       [0.11633857113691416, 0, 0, -1]], dtype=object)

mass#

stumpy.mass(Q, T, M_T=None, Σ_T=None, normalize=True, p=2.0, T_subseq_isfinite=None, T_subseq_isconstant=None, Q_subseq_isconstant=None, query_idx=None)[source]#

Compute the distance profile using the MASS algorithm

This is a convenience wrapper around the Numba JIT compiled _mass function.

Parameters:
  • Q (numpy.ndarray) – Query array or subsequence

  • T (numpy.ndarray) – Time series or sequence

  • M_T (numpy.ndarray, default None) – Sliding mean of T

  • Σ_T (numpy.ndarray, default None) – Sliding standard deviation of T

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • T_subseq_isfinite (numpy.ndarray, default None) – A boolean array that indicates whether a subsequence in T contains a np.nan/np.inf value (False). This parameter is ignored when normalize=True.

  • T_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • Q_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether the subsequence in Q is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether the subsequence in Q is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • query_idx (int, default None) – This is the index position along the time series, T, where the query subsequence, Q, is located. query_idx should be set to None if Q is not a subsequence of T. If Q is a subsequence of T, provding this argument is optional. If query_idx is provided, the distance between Q and T[query_idx : query_idx + m] will automatically be set to zero.

Returns:

distance_profile – Distance profile

Return type:

numpy.ndarray

See also

stumpy.motifs

Discover the top motifs for time series T

stumpy.match

Find all matches of a query Q in a time series T``

Notes

DOI: 10.1109/ICDM.2016.0179

See Table II

Note that Q, T are not directly required to calculate D

Note: Unlike the Matrix Profile I paper, here, M_T, Σ_T can be calculated once for all subsequences of T and passed in so the redundancy is removed

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.mass(
...     np.array([-11.1, 23.4, 79.5, 1001.0]),
...     np.array([584., -11., 23., 79., 1001., 0., -19.]))
array([3.18792463e+00, 1.11297393e-03, 3.23874018e+00, 3.34470195e+00])

scrump#

stumpy.scrump(T_A, m, T_B=None, ignore_trivial=True, percentage=0.01, pre_scrump=False, s=None, normalize=True, p=2.0, k=1, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)[source]#

Compute an approximate z-normalized matrix profile

This is a convenience wrapper around the Numba JIT-compiled parallelized _stump function which computes the matrix profile according to SCRIMP.

Parameters:
  • T_A (numpy.ndarray) – The time series or sequence for which to compute the matrix profile

  • T_B (numpy.ndarray) – The time series or sequence that will be used to annotate T_A. For every subsequence in T_A, its nearest neighbor in T_B will be recorded.

  • m (int) – Window size

  • ignore_trivial (bool) – Set to True if this is a self-join. Otherwise, for AB-join, set this to False. Default is True.

  • percentage (float) – Approximate percentage completed. The value is between 0.0 and 1.0.

  • pre_scrump (bool) – A flag for whether or not to perform the PreSCRIMP calculation prior to computing SCRIMP. If set to True, this is equivalent to computing SCRIMP++ and may lead to faster convergence

  • s (int) – The size of the PreSCRIMP fixed interval. If pre_scrump=True and s=None, then s will automatically be set to s=int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM)), the size of the exclusion zone.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this class gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized class decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • k (int, default 1) – The number of top k smallest distances used to construct the matrix profile. Note that this will increase the total computational time and memory usage when k > 1.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.P_#

The updated (top-k) matrix profile. When k=1 (default), this output is a 1D array consisting of the matrix profile. When k > 1, the output is a 2D array that has exactly k columns consisting of the top-k matrix profile.

Type:

numpy.ndarray

stumpy.I_#

The updated (top-k) matrix profile indices. When k=1 (default), this output is a 1D array consisting of the matrix profile indices. When k > 1, the output is a 2D array that has exactly k columns consisting of the top-k matrix profile indiecs.

Type:

numpy.ndarray

stumpy.left_I_#

The updated left (top-1) matrix profile indices

Type:

numpy.ndarray

stumpy.right_I_#

The updated right (top-1) matrix profile indices

Type:

numpy.ndarray

stumpy.update()#

Update the matrix profile and the matrix profile indices by computing additional new distances (limited by percentage) that make up the full distance matrix. It updates the (top-k) matrix profile, (top-1) left matrix profile, (top-1) right matrix profile, (top-k) matrix profile indices, (top-1) left matrix profile indices, and (top-1) right matrix profile indices.

See also

stumpy.stump

Compute the z-normalized matrix profile

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

Notes

DOI: 10.1109/ICDM.2018.00099

See Algorithm 1 and Algorithm 2

Examples

>>> import stumpy
>>> import numpy as np
>>> approx_mp = stumpy.scrump(
...     np.array([584., -11., 23., 79., 1001., 0., -19.]),
...     m=3)
>>> approx_mp.update()
>>> approx_mp.P_
array([2.982409  , 3.28412702,        inf, 2.982409  , 3.28412702])
>>> approx_mp.I_
array([ 3,  4, -1,  0,  1])

stumpi#

stumpy.stumpi(T, m, egress=True, normalize=True, p=2.0, k=1, mp=None, T_subseq_isconstant_func=None)[source]#

Compute an incremental z-normalized matrix profile for streaming data

This is based on the on-line STOMPI and STAMPI algorithms.

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which the matrix profile and matrix profile indices will be returned

  • m (int) – Window size

  • egress (bool, default True) – If set to True, the oldest data point in the time series is removed and the time series length remains constant rather than forever increasing

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this class gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized class decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • k (int, default 1) – The number of top k smallest distances used to construct the matrix profile. Note that this will increase the total computational time and memory usage when k > 1.

  • mp (numpy.ndarry, default None) – A pre-computed matrix profile (and corresponding matrix profile indices). This is a 2D array of shape (len(T) - m + 1, 2 * k + 2), where the first k columns are top-k matrix profile, and the next k columns are their corresponding indices. The last two columns correspond to the top-1 left and top-1 right matrix profile indices. When None (default), this array is computed internally using stumpy.stump.

  • T_subseq_isconstant_func (function, default None) – A custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.P_#

The updated (top-k) matrix profile for T. When k=1 (default), the first (and only) column in this 2D array consists of the matrix profile. When k > 1, the output has exactly k columns consisting of the top-k matrix profile.

Type:

numpy.ndarray

stumpy.I_#

The updated (top-k) matrix profile indices for T. When k=1 (default), the first (and only) column in this 2D array consists of the matrix profile indices. When k > 1, the output has exactly k columns consisting of the top-k matrix profile indices.

Type:

numpy.ndarray

stumpy.left_P_#

The updated left (top-1) matrix profile for T

Type:

numpy.ndarray

stumpy.left_I_#

The updated left (top-1) matrix profile indices for T

Type:

numpy.ndarray

stumpy.T_#

The updated time series or sequence for which the matrix profile and matrix profile indices are computed

Type:

numpy.ndarray

stumpy.update(t)#

Append a single new data point, t, to the time series, T, and update the matrix profile

Notes

DOI: 10.1007/s10618-017-0519-9

See Table V

Note that line 11 is missing an important sqrt operation!

Examples

>>> import stumpy
>>> import numpy as np
>>> stream = stumpy.stumpi(
...     np.array([584., -11., 23., 79., 1001., 0.]),
...     m=3)
>>> stream.update(-19.0)
>>> stream.left_P_
array([       inf, 3.00009263, 2.69407392, 3.05656417])
>>> stream.left_I_
array([-1,  0,  1,  2])

mstump#

stumpy.mstump(T, m, include=None, discords=False, normalize=True, p=2.0, T_subseq_isconstant=None)[source]#

Compute the multi-dimensional z-normalized matrix profile

This is a convenience wrapper around the Numba JIT-compiled parallelized _mstump function which computes the multi-dimensional matrix profile and multi-dimensional matrix profile index according to mSTOMP, a variant of mSTAMP. Note that only self-joins are supported.

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which to compute the multi-dimensional matrix profile. Each row in T represents data from a different dimension while each column in T represents data from the same dimension.

  • m (int) – Window size

  • include (list, numpy.ndarray, default None) –

    A list of (zero-based) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:

    DOI: 10.1109/ICDM.2017.66

  • discords (bool, default False) – When set to True, this reverses the distance matrix which results in a multi-dimensional matrix profile that favors larger matrix profile values (i.e., discords) rather than smaller values (i.e., motifs). Note that indices in include are still maintained and respected.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant (numpy.ndarray, function, or list, default None) – A parameter that is used to show whether a subsequence of a time series in T is constant (True) or not. T_subseq_isconstant can be a 2D boolean numpy.ndarry or a function that can be applied to each time series in T. Alternatively, for maximum flexibility, a list (with length equal to the total number of time series) may also be used. In this case, T_subseq_isconstant[i] corresponds to the i-th time series T[i] and each element in the list can either be a 1D boolean np.ndarray, a function, or None.

Returns:

  • P (numpy.ndarray) – The multi-dimensional matrix profile. Each row of the array corresponds to each matrix profile for a given dimension (i.e., the first row is the 1-D matrix profile and the second row is the 2-D matrix profile).

  • I (numpy.ndarray) – The multi-dimensional matrix profile index where each row of the array corresponds to each matrix profile index for a given dimension.

See also

stumpy.mstumped

Compute the multi-dimensional z-normalized matrix profile with a distributed dask cluster

stumpy.subspace

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

stumpy.mdl

Compute the number of bits needed to compress one array with another using the minimum description length (MDL)

Notes

DOI: 10.1109/ICDM.2017.66

See mSTAMP Algorithm

Examples

>>> stumpy.mstump(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3)
(array([[0.        , 1.43947142, 0.        , 2.69407392, 0.11633857],
        [0.777905  , 2.36179922, 1.50004632, 2.92246722, 0.777905  ]]),
 array([[2, 4, 0, 1, 0],
        [4, 4, 0, 1, 0]]))

mstumped#

stumpy.mstumped(client, T, m, include=None, discords=False, p=2.0, normalize=True, T_subseq_isconstant=None)[source]#

Compute the multi-dimensional z-normalized matrix profile with a distributed dask/ray cluster

This is a highly distributed implementation around the Numba JIT-compiled parallelized _mstump function which computes the multi-dimensional matrix profile according to STOMP. Note that only self-joins are supported.

Parameters:
  • client (client) – A Dask Distributed client that is connected to a Dask scheduler and Dask workers. Setting up a Dask distributed cluster is beyond the scope of this library. Please refer to the Dask Distributed documentation.

  • T (numpy.ndarray) – The time series or sequence for which to compute the multi-dimensional matrix profile. Each row in T represents data from a different dimension while each column in T represents data from the same dimension.

  • m (int) – Window size

  • include (list, numpy.ndarray, default None) –

    A list of (zero-based) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:

    DOI: 10.1109/ICDM.2017.66

  • discords (bool, default False) – When set to True, this reverses the distance matrix which results in a multi-dimensional matrix profile that favors larger matrix profile values (i.e., discords) rather than smaller values (i.e., motifs). Note that indices in include are still maintained and respected.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • T_subseq_isconstant (numpy.ndarray, function, or list, default None) – A parameter that is used to show whether a subsequence of a time series in T is constant (True) or not. T_subseq_isconstant can be a 2D boolean numpy.ndarry or a function that can be applied to each time series in T. Alternatively, for maximum flexibility, a list (with length equal to the total number of time series) may also be used. In this case, T_subseq_isconstant[i] corresponds to the i-th time series T[i] and each element in the list can either be a 1D boolean np.ndarray, a function, or None.

Returns:

  • P (numpy.ndarray) – The multi-dimensional matrix profile. Each row of the array corresponds to each matrix profile for a given dimension (i.e., the first row is the 1-D matrix profile and the second row is the 2-D matrix profile).

  • I (numpy.ndarray) – The multi-dimensional matrix profile index where each row of the array corresponds to each matrix profile index for a given dimension.

See also

stumpy.mstump

Compute the multi-dimensional z-normalized matrix profile

stumpy.subspace

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

stumpy.mdl

Compute the number of bits needed to compress one array with another using the minimum description length (MDL)

Notes

DOI: 10.1109/ICDM.2017.66

See mSTAMP Algorithm

Examples

>>> import stumpy
>>> import numpy as np
>>> from dask.distributed import Client
>>> if __name__ == "__main__":
...     with Client() as dask_client:
...         stumpy.mstumped(
...             np.array([[584., -11., 23., 79., 1001., 0., -19.],
...                       [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...             m=3)
(array([[0.        , 1.43947142, 0.        , 2.69407392, 0.11633857],
        [0.777905  , 2.36179922, 1.50004632, 2.92246722, 0.777905  ]]),
 array([[2, 4, 0, 1, 0],
        [4, 4, 0, 1, 0]]))

subspace#

stumpy.subspace(T, m, subseq_idx, nn_idx, k, include=None, discords=False, discretize_func=None, n_bit=8, normalize=True, p=2.0, T_subseq_isconstant=None)[source]#

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which the multi-dimensional matrix profile, multi-dimensional matrix profile indices were computed

  • m (int) – Window size

  • subseq_idx (int) – The subsequence index in T

  • nn_idx (int) – The nearest neighbor index in T

  • k (int) – The subset number of dimensions out of D = T.shape[0]-dimensions to return the subspace for. Note that zero-based indexing is used.

  • include (numpy.ndarray, default None) –

    A list of (zero-based) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:

    DOI: 10.1109/ICDM.2017.66

  • discords (bool, default False) – When set to True, this reverses the distance profile to favor discords rather than motifs. Note that indices in include are still maintained and respected.

  • discretize_func (func, default None) – A function for discretizing each input array. When this is None, an appropriate discretization function (based on the normalize parameter) will be applied.

  • n_bit (int, default 8) –

    The number of bits used for discretization. For more information on an appropriate value, see Figure 4 in:

    DOI: 10.1109/ICDM.2016.0069

    and Figure 2 in:

    DOI: 10.1109/ICDM.2011.54

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant (numpy.ndarray, function, or list, default None) – A parameter that is used to show whether a subsequence of a time series in T is constant (True) or not. T_subseq_isconstant can be a 2D boolean numpy.ndarry or a function that can be applied to each time series in T. Alternatively, for maximum flexibility, a list (with length equal to the total number of time series) may also be used. In this case, T_subseq_isconstant[i] corresponds to the i-th time series T[i] and each element in the list can either be a 1D boolean np.ndarray, a function, or None.

Returns:

S – An array that contains the (singular) k`th-dimensional subspace for the subsequence with index equal to `subseq_idx. Note that k+1 rows will be returned.

Return type:

numpy.ndarray

See also

stumpy.mstump

Compute the multi-dimensional z-normalized matrix profile

stumpy.mstumped

Compute the multi-dimensional z-normalized matrix profile with a distributed dask cluster

stumpy.mdl

Compute the number of bits needed to compress one array with another using the minimum description length (MDL)

Examples

>>> import stumpy
>>> import numpy as np
>>> mps, indices = stumpy.mstump(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3)
>>> motifs_idx = np.argsort(mps, axis=1)[:, :2]
>>> k = 1
>>> stumpy.subspace(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3,
...     subseq_idx=motifs_idx[k][0],
...     nn_idx=indices[k][motifs_idx[k][0]],
...     k=k)
array([0, 1])

mdl#

stumpy.mdl(T, m, subseq_idx, nn_idx, include=None, discords=False, discretize_func=None, n_bit=8, normalize=True, p=2.0, T_subseq_isconstant=None)[source]#

Compute the multi-dimensional number of bits needed to compress one multi-dimensional subsequence with another along each of the k-dimensions using the minimum description length (MDL)

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which the multi-dimensional matrix profile, multi-dimensional matrix profile indices were computed

  • m (int) – Window size

  • subseq_idx (numpy.ndarray) – The multi-dimensional subsequence indices in T

  • nn_idx (numpy.ndarray) – The multi-dimensional nearest neighbor index in T

  • include (numpy.ndarray, default None) –

    A list of (zero-based) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:

    DOI: 10.1109/ICDM.2017.66

  • discords (bool, default False) – When set to True, this reverses the distance profile to favor discords rather than motifs. Note that indices in include are still maintained and respected.

  • discretize_func (func, default None) – A function for discretizing each input array. When this is None, an appropriate discretization function (based on the normalization parameter) will be applied.

  • n_bit (int, default 8) –

    The number of bits used for discretization and for computing the bit size. For more information on an appropriate value, see Figure 4 in:

    DOI: 10.1109/ICDM.2016.0069

    and Figure 2 in:

    DOI: 10.1109/ICDM.2011.54

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant (numpy.ndarray, function, or list, default None) – A parameter that is used to show whether a subsequence of a time series in T is constant (True) or not. T_subseq_isconstant can be a 2D boolean numpy.ndarry or a function that can be applied to each time series in T. Alternatively, for maximum flexibility, a list (with length equal to the total number of time series) may also be used. In this case, T_subseq_isconstant[i] corresponds to the i-th time series T[i] and each element in the list can either be a 1D boolean np.ndarray, a function, or None.

Returns:

  • bit_sizes (numpy.ndarray) – The total number of bits computed from MDL for representing each pair of multidimensional subsequences.

  • S (list) – A list of numpy.ndarrays that contain the `k`th-dimensional subspaces

See also

stumpy.mstump

Compute the multi-dimensional z-normalized matrix profile

stumpy.mstumped

Compute the multi-dimensional z-normalized matrix profile with a distributed dask cluster

stumpy.subspace

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

Examples

>>> import stumpy
>>> import numpy as np
>>> mps, indices = stumpy.mstump(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3)
>>> motifs_idx = np.argsort(mps, axis=1)[:, 0]
>>> stumpy.mdl(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3,
...     subseq_idx=motifs_idx,
...     nn_idx=indices[np.arange(motifs_idx.shape[0]), motifs_idx])
(array([ 80.      , 111.509775]), [array([1]), array([0, 1])])

atsc#

stumpy.atsc(IL, IR, j)[source]#

Compute the anchored time series chain (ATSC)

Note that since the matrix profile indices, IL and IR, are pre-computed, this function is agnostic to subsequence normalization.

Parameters:
  • IL (numpy.ndarray) – Left matrix profile indices

  • IR (numpy.ndarray) – Right matrix profile indices

  • j (int) – The index value for which to compute the ATSC

Returns:

out – Anchored time series chain for index, j

Return type:

numpy.ndarray

See also

stumpy.allc

Compute the all-chain set (ALLC)

Notes

DOI: 10.1109/ICDM.2017.79

See Table I

This is the implementation for the anchored time series chains (ATSC).

Unlike the original paper, we’ve replaced the while-loop with a more stable for-loop.

Examples

>>> import stumpy
>>> import numpy as np
>>> mp = stumpy.stump(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3)
>>> stumpy.atsc(mp[:, 2], mp[:, 3], 1)
array([1, 3])

allc#

stumpy.allc(IL, IR)[source]#

Compute the all-chain set (ALLC)

Note that since the matrix profile indices, IL and IR, are pre-computed, this function is agnostic to subsequence normalization.

Parameters:
  • IL (numpy.ndarray) – Left matrix profile indices

  • IR (numpy.ndarray) – Right matrix profile indices

Returns:

  • S (list(numpy.ndarray)) – All-chain set

  • C (numpy.ndarray) – Anchored time series chain for the longest chain (also known as the unanchored chain). Note that when there are multiple different chains with length equal to len(C), then only one chain from this set is returned. You may iterate over the all-chain set, S, to find all other possible chains with length len(C).

See also

stumpy.atsc

Compute the anchored time series chain (ATSC)

Notes

DOI: 10.1109/ICDM.2017.79

See Table II

Unlike the original paper, we’ve replaced the while-loop with a more stable for-loop.

This is the implementation for the all-chain set (ALLC) and the unanchored chain is simply the longest one among the all-chain set. Both the all-chain set and unanchored chain are returned.

The all-chain set, S, is returned as a list of unique numpy arrays.

Examples

>>> import stumpy
>>> import numpy as np
>>> mp = stumpy.stump(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3)
>>> stumpy.allc(mp[:, 2], mp[:, 3])
([array([1, 3]), array([2]), array([0, 4])], array([0, 4]))

fluss#

stumpy.fluss(I, L, n_regimes, excl_factor=5, custom_iac=None)[source]#

Compute the Fast Low-cost Unipotent Semantic Segmentation (FLUSS) for static data (i.e., batch processing)

Essentially, this is a wrapper to compute the corrected arc curve and regime locations. Note that since the matrix profile indices, I, are pre-computed, this function is agnostic to subsequence normalization.

Parameters:
  • I (numpy.ndarray) – The matrix profile indices for the time series of interest

  • L (int) – The subsequence length that is set roughly to be one period length. This is likely to be the same value as the window size, m, used to compute the matrix profile and matrix profile index but it can be different since this is only used to manage edge effects and has no bearing on any of the IAC or CAC core calculations.

  • n_regimes (int) – The number of regimes to search for. This is one more than the number of regime changes as denoted in the original paper.

  • excl_factor (int, default 5) – The multiplying factor for the regime exclusion zone

  • custom_iac (numpy.ndarray, default None) – A custom idealized arc curve (IAC) that will used for correcting the arc curve

Returns:

  • cac (numpy.ndarray) – A corrected arc curve (CAC)

  • regime_locs (numpy.ndarray) – The locations of the regimes

See also

stumpy.floss

Compute the Fast Low-Cost Online Semantic Segmentation (FLOSS) for streaming data

Notes

DOI: 10.1109/ICDM.2017.21

See Section A

This is the implementation for Fast Low-cost Unipotent Semantic Segmentation (FLUSS).

Examples

>>> import stumpy
>>> import numpy as np
>>> mp = stumpy.stump(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3)
>>> stumpy.fluss(mp[:, 0], 3, 2)
(array([1., 1., 1., 1., 1.]), array([0]))

floss#

stumpy.floss(mp, T, m, L, excl_factor=5, n_iter=1000, n_samples=1000, custom_iac=None, normalize=True, p=2.0, T_subseq_isconstant_func=None)[source]#

Compute the Fast Low-cost Online Semantic Segmentation (FLOSS) for streaming data

Parameters:
  • mp (numpy.ndarray) – The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.

  • T (numpy.ndarray) – A 1-D time series data used to generate the matrix profile and matrix profile indices found in mp. Note that the the right matrix profile index is used and the right matrix profile is intelligently recomputed on the fly from T instead of using the bidirectional matrix profile.

  • m (int) – The window size for computing sliding window mass. This is identical to the window size used in the matrix profile calculation. For managing edge effects, see the L parameter.

  • L (int) – The subsequence length that is set roughly to be one period length. This is likely to be the same value as the window size, m, used to compute the matrix profile and matrix profile index but it can be different since this is only used to manage edge effects and has no bearing on any of the IAC or CAC core calculations.

  • excl_factor (int, default 5) – The multiplying factor for the regime exclusion zone. Note that this is unrelated to the excl_zone used in to compute the matrix profile.

  • n_iter (int, default 1000) – Number of iterations to average over when determining the parameters for the IAC beta distribution

  • n_samples (int, default 1000) – Number of distribution samples to draw during each iteration when computing the IAC

  • custom_iac (numpy.ndarray, default None) – A custom idealized arc curve (IAC) that will used for correcting the arc curve

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant_func (function, default None) – A custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.cac_1d_#

A 1-dimensional corrected arc curve (CAC) updated as a result of ingressing a single new data point and egressing a single old data point.

Type:

numpy.ndarray

stumpy.P_#

The matrix profile updated as a result of ingressing a single new data point and egressing a single old data point.

Type:

numpy.ndarray

stumpy.I_#

The (right) matrix profile indices updated as a result of ingressing a single new data point and egressing a single old data point.

Type:

numpy.ndarray

stumpy.T_#

The updated time series, T

Type:

numpy.ndarray

stumpy.update(t)#

Ingress a new data point, t, onto the time series, T, followed by egressing the oldest single data point from T. Then, update the 1-dimensional corrected arc curve (CAC_1D) and the matrix profile.

See also

stumpy.fluss

Compute the Fast Low-cost Unipotent Semantic Segmentation (FLUSS) for static data (i.e., batch processing)

Notes

DOI: 10.1109/ICDM.2017.21

See Section C

This is the implementation for Fast Low-cost Online Semantic Segmentation (FLOSS).

Examples

>>> import stumpy
>>> import numpy as np
>>> mp = stumpy.stump(np.array([584., -11., 23., 79., 1001., 0.]), m=3)
>>> stream = stumpy.floss(
...     mp,
...     np.array([584., -11., 23., 79., 1001., 0.]),
...     m=3,
...     L=3)
>>> stream.update(19.)
>>> stream.cac_1d_
array([1., 1., 1., 1.])

ostinato#

stumpy.ostinato(Ts, m, normalize=True, p=2.0, Ts_subseq_isconstant=None)[source]#

Find the z-normalized consensus motif of multiple time series

This is a wrapper around the vanilla version of the ostinato algorithm which finds the best radius and a helper function that finds the most central conserved motif.

Parameters:
  • Ts (list) – A list of time series for which to find the most central consensus motif

  • m (int) – Window size

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • Ts_subseq_isconstant (list, default None) – A list of rolling window isconstant for each time series in Ts.

Returns:

  • central_radius (float) – Radius of the most central consensus motif

  • central_Ts_idx (int) – The time series index in Ts which contains the most central consensus motif

  • central_subseq_idx (int) – The subsequence index within time series Ts[central_motif_Ts_idx] the contains most central consensus motif

See also

stumpy.ostinatoed

Find the z-normalized consensus motif of multiple time series with a distributed cluster

stumpy.gpu_ostinato

Find the z-normalized consensus motif of multiple time series with one or more GPU devices

Notes

DOI: 10.1109/ICDM.2019.00140

See Table 2

The ostinato algorithm proposed in the paper finds the best radius in Ts. Intuitively, the radius is the minimum distance of a subsequence to encompass at least one nearest neighbor subsequence from all other time series. The best radius in Ts is the minimum radius amongst all radii. Some data sets might contain multiple subsequences which have the same optimal radius. The greedy Ostinato algorithm only finds one of them, which might not be the most central motif. The most central motif amongst the subsequences with the best radius is the one with the smallest mean distance to nearest neighbors in all other time series. To find this central motif it is necessary to search the subsequences with the best radius via stumpy.ostinato._get_central_motif

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.ostinato(
...     [np.array([584., -11., 23., 79., 1001., 0., 19.]),
...      np.array([600., -10., 23., 17.]),
...      np.array([  1.,   9.,  6.,  0.])],
...     m=3)
(1.2370237678153826, 0, 4)

ostinatoed#

stumpy.ostinatoed(client, Ts, m, normalize=True, p=2.0, Ts_subseq_isconstant=None)[source]#

Find the z-normalized consensus motif of multiple time series with a distributed cluster

This is a wrapper around the vanilla version of the ostinato algorithm which finds the best radius and a helper function that finds the most central conserved motif.

Parameters:
  • client (client) – A Dask or Ray Distributed client. Setting up a distributed cluster is beyond the scope of this library. Please refer to the Dask or Ray Distributed documentation.

  • Ts (list) – A list of time series for which to find the most central consensus motif

  • m (int) – Window size

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • Ts_subseq_isconstant (list, default None) – A list of rolling window isconstant for each time series in Ts.

Returns:

  • central_radius (float) – Radius of the most central consensus motif

  • central_Ts_idx (int) – The time series index in Ts which contains the most central consensus motif

  • central_subseq_idx (int) – The subsequence index within time series Ts[central_motif_Ts_idx] the contains most central consensus motif

See also

stumpy.ostinato

Find the z-normalized consensus motif of multiple time series

stumpy.gpu_ostinato

Find the z-normalized consensus motif of multiple time series with one or more GPU devices

Notes

DOI: 10.1109/ICDM.2019.00140

See Table 2

The ostinato algorithm proposed in the paper finds the best radius in Ts. Intuitively, the radius is the minimum distance of a subsequence to encompass at least one nearest neighbor subsequence from all other time series. The best radius in Ts is the minimum radius amongst all radii. Some data sets might contain multiple subsequences which have the same optimal radius. The greedy Ostinato algorithm only finds one of them, which might not be the most central motif. The most central motif amongst the subsequences with the best radius is the one with the smallest mean distance to nearest neighbors in all other time series. To find this central motif it is necessary to search the subsequences with the best radius via stumpy.ostinato._get_central_motif

Examples

>>> import stumpy
>>> import numpy as np
>>> from dask.distributed import Client
>>> if __name__ == "__main__":
...     with Client() as dask_client:
...         stumpy.ostinatoed(
...             dask_client,
...             [np.array([584., -11., 23., 79., 1001., 0., 19.]),
...              np.array([600., -10., 23., 17.]),
...              np.array([  1.,   9.,  6.,  0.])],
...             m=3)
(1.2370237678153826, 0, 4)

gpu_ostinato#

stumpy.gpu_ostinato(Ts, m, device_id=0, normalize=True, p=2.0, Ts_subseq_isconstant=None)#

Find the z-normalized consensus motif of multiple time series with one or more GPU devices

This is a wrapper around the vanilla version of the ostinato algorithm which finds the best radius and a helper function that finds the most central conserved motif.

Parameters:
  • Ts (list) – A list of time series for which to find the most central consensus motif

  • m (int) – Window size

  • device_id (int or list, default 0) – The (GPU) device number to use. The default value is 0. A list of valid device ids (int) may also be provided for parallel GPU-STUMP computation. A list of all valid device ids can be obtained by executing [device.id for device in numba.cuda.list_devices()].

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • Ts_subseq_isconstant (list, default None) – A list of rolling window isconstant for each time series in Ts.

Returns:

  • central_radius (float) – Radius of the most central consensus motif

  • central_Ts_idx (int) – The time series index in Ts which contains the most central consensus motif

  • central_subseq_idx (int) – The subsequence index within time series Ts[central_motif_Ts_idx] the contains most central consensus motif

See also

stumpy.ostinato

Find the z-normalized consensus motif of multiple time series

stumpy.ostinatoed

Find the z-normalized consensus motif of multiple time series with a distributed dask cluster

Notes

DOI: 10.1109/ICDM.2019.00140

See Table 2

The ostinato algorithm proposed in the paper finds the best radius in Ts. Intuitively, the radius is the minimum distance of a subsequence to encompass at least one nearest neighbor subsequence from all other time series. The best radius in Ts is the minimum radius amongst all radii. Some data sets might contain multiple subsequences which have the same optimal radius. The greedy Ostinato algorithm only finds one of them, which might not be the most central motif. The most central motif amongst the subsequences with the best radius is the one with the smallest mean distance to nearest neighbors in all other time series. To find this central motif it is necessary to search the subsequences with the best radius via stumpy.ostinato._get_central_motif

Examples

>>> import stumpy
>>> import numpy as np
>>> from numba import cuda
>>> if __name__ == "__main__":
...     all_gpu_devices = [device.id for device in cuda.list_devices()]
...     stumpy.gpu_ostinato(
...         [np.array([584., -11., 23., 79., 1001., 0., 19.]),
...          np.array([600., -10., 23., 17.]),
...          np.array([  1.,   9.,  6.,  0.])],
...         m=3,
...         device_id=all_gpu_devices)
(1.2370237678153826, 0, 4)

mpdist#

stumpy.mpdist(T_A, T_B, m, percentage=0.05, k=None, normalize=True, p=2.0, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)[source]#

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series

The MPdist distance measure considers two time series to be similar if they share many subsequences, regardless of the order of matching subsequences. MPdist concatenates the output of an AB-join and a BA-join and returns the `k`th smallest value as the reported distance. Note that MPdist is a measure and not a metric. Therefore, it does not obey the triangular inequality but the method is highly scalable.

Parameters:
  • T_A (numpy.ndarray) – The first time series or sequence for which to compute the matrix profile

  • T_B (numpy.ndarray) – The second time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • percentage (float, default 0.05) – The percentage of distances that will be used to report mpdist. The value is between 0.0 and 1.0.

  • k (int) – Specify the k`th value in the concatenated matrix profiles to return. When `k is not None, then the percentage parameter is ignored.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

MPdist – The matrix profile distance

Return type:

float

See also

mpdisted

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with a distributed dask cluster

gpu_mpdist

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with one or more GPU devices

Notes

DOI: 10.1109/ICDM.2018.00119

See Section III

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.mpdist(
...     np.array([-11.1, 23.4, 79.5, 1001.0]),
...     np.array([584., -11., 23., 79., 1001., 0., -19.]),
...     m=3)
0.00019935236191097894

mpdisted#

stumpy.mpdisted(client, T_A, T_B, m, percentage=0.05, k=None, normalize=True, p=2.0, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)[source]#

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with a distributed dask/ray cluster

The MPdist distance measure considers two time series to be similar if they share many subsequences, regardless of the order of matching subsequences. MPdist concatenates the output of an AB-join and a BA-join and returns the `k`th smallest value as the reported distance. Note that MPdist is a measure and not a metric. Therefore, it does not obey the triangular inequality but the method is highly scalable.

Parameters:
  • client (client) – A Dask or Ray Distributed client. Setting up a distributed cluster is beyond the scope of this library. Please refer to the Dask or Ray Distributed documentation.

  • T_A (numpy.ndarray) – The first time series or sequence for which to compute the matrix profile

  • T_B (numpy.ndarray) – The second time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • percentage (float, default 0.05) – The percentage of distances that will be used to report mpdist. The value is between 0.0 and 1.0. This parameter is ignored when k is not None.

  • k (int) – Specify the k`th value in the concatenated matrix profiles to return. When `k is not None, then the percentage parameter is ignored.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

MPdist – The matrix profile distance

Return type:

float

See also

mpdist

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series

gpu_mpdist

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with one or more GPU devices

Notes

DOI: 10.1109/ICDM.2018.00119

See Section III

Examples

>>> import stumpy
>>> import numpy as np
>>> from dask.distributed import Client
>>> if __name__ == "__main__":
...     with Client() as dask_client:
...         stumpy.mpdisted(
...             dask_client,
...             np.array([-11.1, 23.4, 79.5, 1001.0]),
...             np.array([584., -11., 23., 79., 1001., 0., -19.]),
...             m=3)
0.00019935236191097894

gpu_mpdist#

stumpy.gpu_mpdist(T_A, T_B, m, percentage=0.05, k=None, device_id=0, normalize=True, p=2.0, T_A_subseq_isconstant=None, T_B_subseq_isconstant=None)#

Compute the z-normalized matrix profile distance (MPdist) measure between any two time series with one or more GPU devices

The MPdist distance measure considers two time series to be similar if they share many subsequences, regardless of the order of matching subsequences. MPdist concatenates and sorts the output of an AB-join and a BA-join and returns the value of the `k`th smallest number as the reported distance. Note that MPdist is a measure and not a metric. Therefore, it does not obey the triangular inequality but the method is highly scalable.

Parameters:
  • T_A (numpy.ndarray) – The first time series or sequence for which to compute the matrix profile

  • T_B (numpy.ndarray) – The second time series or sequence for which to compute the matrix profile

  • m (int) – Window size

  • percentage (float, default 0.05) – The percentage of distances that will be used to report mpdist. The value is between 0.0 and 1.0. This parameter is ignored when k is not None.

  • k (int, default None) – Specify the k`th value in the concatenated matrix profiles to return. When `k is not None, then the percentage parameter is ignored.

  • device_id (int or list, default 0) – The (GPU) device number to use. The default value is 0. A list of valid device ids (int) may also be provided for parallel GPU-STUMP computation. A list of all valid device ids can be obtained by executing [device.id for device in numba.cuda.list_devices()].

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_A_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_A is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_A is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • T_B_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T_B is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T_B is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

MPdist – The matrix profile distance

Return type:

float

Notes

DOI: 10.1109/ICDM.2018.00119

See Section III

Examples

>>> import stumpy
>>> import numpy as np
>>> from numba import cuda
>>> if __name__ == "__main__":
...     all_gpu_devices = [device.id for device in cuda.list_devices()]
...     stumpy.gpu_mpdist(
...         np.array([-11.1, 23.4, 79.5, 1001.0]),
...         np.array([584., -11., 23., 79., 1001., 0., -19.]),
...         m=3,
...         device_id=all_gpu_devices)
0.00019935236191097894

motifs#

stumpy.motifs(T, P, min_neighbors=1, max_distance=None, cutoff=None, max_matches=10, max_motifs=1, atol=1e-08, normalize=True, p=2.0, T_subseq_isconstant=None)[source]#

Discover the top motifs for time series T

A subsequence, Q, becomes a candidate motif if there are at least min_neighbor number of other subsequence matches in T (outside the exclusion zone) with a distance less or equal to max_distance.

Note that, in the best case scenario, the returned arrays would have shape (max_motifs, max_matches) and contain all finite values. However, in reality, many conditions (see below) need to be satisfied in order for this to be true. Any truncation in the number of rows (i.e., motifs) may be the result of insufficient candidate motifs with matches greater than or equal to min_neighbors or that the matrix profile value for the candidate motif was larger than cutoff. Similarly, any truncation in the number of columns (i.e., matches) may be the result of insufficient matches being found with distances (to their corresponding candidate motif) that are equal to or less than max_distance. Only motifs and matches that satisfy all of these constraints will be returned.

If you must return a shape of (max_motifs, max_matches), then you may consider specifying a smaller min_neighbors, a larger max_distance, and/or a larger cutoff. For example, while it is ill advised, setting min_neighbors=1, max_distance=np.inf, and cutoff=np.inf will ensure that the shape of the output arrays will be (max_motifs, max_matches). However, given the lack of constraints, the quality of each motif and the quality of each match may be drastically different. Setting appropriate conditions will help ensure appropriately constrained results that may be easier to interpret.

Parameters:
  • T (numpy.ndarray) – The time series or sequence

  • P (numpy.ndarray) – The (1-dimensional) matrix profile of T. In the case where the matrix profile was computed with k > 1 (i.e., top-k nearest neighbors), you must summarize the top-k nearest-neighbor distances for each subsequence into a single value (e.g., np.mean, np.min, etc) and then use that derived value as your P.

  • min_neighbors (int, default 1) – The minimum number of similar matches a subsequence needs to have in order to be considered a motif. This defaults to 1, which means that a subsequence must have at least one similar match in order to be considered a motif.

  • max_distance (float or function, default None) – For a candidate motif, Q, and a non-trivial subsequence, S, max_distance is the maximum distance allowed between Q and S so that S is considered a match of Q. If max_distance is a function, then it must be a function that accepts a single parameter, D, in its function signature, which is the distance profile between Q and T. If None, this defaults to np.nanmax([np.nanmean(D) - 2.0 * np.nanstd(D), np.nanmin(D)]).

  • cutoff (float, default None) – The largest matrix profile value (distance) that a candidate motif is allowed to have. If None, this defaults to np.nanmax([np.nanmean(P) - 2.0 * np.nanstd(P), np.nanmin(P)])

  • max_matches (int, default 10) – The maximum amount of similar matches of a motif representative to be returned. The resulting matches are sorted by distance, so a value of 10 means that the indices of the most similar 10 subsequences is returned. If None, all matches within max_distance of the motif representative will be returned. Note that the first match is always the self-match/trivial-match for each motif.

  • max_motifs (int, default 1) – The maximum number of motifs to return

  • atol (float, default 1e-8) – The absolute tolerance parameter. This value will be added to max_distance when comparing distances between subsequences.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence in T is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

  • motif_distances (numpy.ndarray) – The distances corresponding to a set of subsequence matches for each motif. Note that the first column always corresponds to the distance for the self-match/trivial-match for each motif.

  • motif_indices (numpy.ndarray) – The indices corresponding to a set of subsequences matches for each motif. Note that the first column always corresponds to the index for the self-match/trivial-match for each motif.

See also

stumpy.match

Find all matches of a query Q in a time series T

stumpy.mmotifs

Discover the top motifs for the multi-dimensional time series T

stumpy.stump

Compute the z-normalized matrix profile

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

stumpy.scrump

Compute an approximate z-normalized matrix profile

Examples

>>> import stumpy
>>> import numpy as np
>>> mp = stumpy.stump(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3)
>>> stumpy.motifs(
...     np.array([584., -11., 23., 79., 1001., 0., -19.]),
...     mp[:, 0],
...     max_distance=2.0)
(array([[0.        , 0.11633857]]), array([[0, 4]]))

match#

stumpy.match(Q, T, M_T=None, Σ_T=None, max_distance=None, max_matches=None, atol=1e-08, query_idx=None, normalize=True, p=2.0, T_subseq_isfinite=None, T_subseq_isconstant=None, Q_subseq_isconstant=None)[source]#

Find all matches of a query Q in a time series T

The indices of subsequences whose distances to Q are less than or equal to max_distance, sorted by distance (lowest to highest). Around each occurrence an exclusion zone is applied before searching for the next.

Parameters:
  • Q (numpy.ndarray) – The query sequence. It doesn’t have to be a subsequence of T

  • T (numpy.ndarray) – The time series of interest

  • M_T (numpy.ndarray, default None) – Sliding mean of time series, T

  • Σ_T (numpy.ndarray, default None) – Sliding standard deviation of time series, T

  • max_distance (float or function, default None) – Maximum distance between Q and a subsequence S for S to be considered a match. If a function, then it has to be a function of one argument D, which will be the distance profile of Q with T (a 1D numpy array of size n-m+1). If None, this defaults to np.nanmax([np.nanmean(D) - 2 * np.nanstd(D), np.nanmin(D)]) (i.e. at least the closest match will be returned).

  • max_matches (int, default None) – The maximum amount of similar occurrences to be returned. The resulting occurrences are sorted by distance, so a value of 10 means that the indices of the most similar 10 subsequences is returned. If None, then all occurrences are returned.

  • atol (float, default 1e-8) – The absolute tolerance parameter. This value will be added to max_distance when comparing distances between subsequences.

  • query_idx (int, default None) – This is the index position along the time series, T, where the query subsequence, Q, is located. query_idx should only be used when the matrix profile is a self-join and should be set to None for matrix profiles computed from AB-joins. If query_idx is set to a specific integer value, then this will help ensure that the self-match will be returned first.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isfinite (numpy.ndarray) – A boolean array that indicates whether a subsequence in T contains a np.nan/np.inf value (False). This parameter is ignored when normalize=True.

  • T_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence (of length Q) in T is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

  • Q_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array (of size 1) that indicates whether Q is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in Q is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

out – The first column consists of distances of subsequences of T whose distances to Q are less than or equal to max_distance, sorted by distance (lowest to highest). The second column consists of the corresponding indices in T.

Return type:

numpy.ndarray

See also

stumpy.motifs

Discover the top motifs for time series T

stumpy.mmotifs

Discover the top motifs for the multi-dimensional time series T

stumpy.stump

Compute the z-normalized matrix profile

stumpy.stumped

Compute the z-normalized matrix profile with a distributed dask cluster

stumpy.gpu_stump

Compute the z-normalized matrix profile with one or more GPU devices

stumpy.scrump

Compute an approximate z-normalized matrix profile

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.match(
...     np.array([-11.1, 23.4, 79.5, 1001.0]),
...     np.array([584., -11., 23., 79., 1001., 0., -19.])
...     )
array([[0.0011129739290248121, 1]], dtype=object)

mmotifs#

stumpy.mmotifs(T, P, I, min_neighbors=1, max_distance=None, cutoffs=None, max_matches=10, max_motifs=1, atol=1e-08, k=None, include=None, normalize=True, p=2.0, T_subseq_isconstant=None)[source]#

Discover the top motifs for the multi-dimensional time series T

Parameters:
  • T (numpy.ndarray) – The multi-dimensional time series or sequence

  • P (numpy.ndarray) – Multi-dimensional Matrix Profile of T

  • I (numpy.ndarray) – Multi-dimensional Matrix Profile indices

  • min_neighbors (int, default 1) – The minimum number of similar matches a subsequence needs to have in order to be considered a motif. This defaults to 1, which means that a subsequence must have at least one similar match in order to be considered a motif.

  • max_distance (flaot, default None) – Maximal distance that is allowed between a query subsequence (a candidate motif) and all subsequences in T to be considered as a match. If None, this defaults to np.nanmax([np.nanmean(D) - 2 * np.nanstd(D), np.nanmin(D)]) (i.e. at least the closest match will be returned).

  • cutoffs (numpy.ndarray or float, default None) – The largest matrix profile value (distance) for each dimension of the multidimensional matrix profile that a multidimenisonal candidate motif is allowed to have. If cutoffs is a scalar value, then this value will be applied to every dimension.

  • max_matches (int, default 10) – The maximum number of similar matches (nearest neighbors) to return for each motif. The first match is always the self/trivial-match for each motif.

  • max_motifs (int, default 1) – The maximum number of motifs to return

  • atol (float, default 1e-8) – The absolute tolerance parameter. This value will be added to max_distance when comparing distances between subsequences.

  • k (int, default None) – The number of dimensions (k + 1) required for discovering all motifs. This value is available for doing guided search or, together with include, for constrained search. If k is None, then this will be automatically be computed for each motif using MDL (unconstrained search).

  • include (numpy.ndarray, default None) – A list of (zero based) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant (numpy.ndarray, function, or list, default None) – A parameter that is used to show whether a subsequence of a time series in T is constant (True) or not. T_subseq_isconstant can be a 2D boolean numpy.ndarry or a function that can be applied to each time series in T. Alternatively, for maximum flexibility, a list (with length equal to the total number of time series) may also be used. In this case, T_subseq_isconstant[i] corresponds to the i-th time series T[i] and each element in the list can either be a 1D boolean np.ndarray, a function, or None.

Returns:

  • motif_distances (numpy.ndarray) – The distances corresponding to a set of subsequence matches for each motif.

  • motif_indices (numpy.ndarray) – The indices corresponding to a set of subsequences matches for each motif.

  • motif_subspaces (list) – A list consisting of arrays that contain the k-dimensional subspace for each motif.

  • motif_mdls (list) – A list consisting of arrays that contain the mdl results for finding the dimension of each motif

See also

stumpy.motifs

Find the top motifs for time series T

stumpy.match

Find all matches of a query Q in a time series T

stumpy.mstump

Compute the multi-dimensional z-normalized matrix profile

stumpy.mstumped

Compute the multi-dimensional z-normalized matrix profile with a distributed dask cluster

stumpy.subspace

Compute the k-dimensional matrix profile subspace for a given subsequence index and its nearest neighbor index

stumpy.mdl

Compute the number of bits needed to compress one array with another using the minimum description length (MDL)

Notes

DOI: 10.1109/ICDM.2017.66

For more information on include and search types, see Section IV D and IV E

Examples

>>> import stumpy
>>> import numpy as np
>>> mps, indices = stumpy.mstump(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     m=3)
>>> stumpy.mmotifs(
...     np.array([[584., -11., 23., 79., 1001., 0., -19.],
...               [  1.,   2.,  4.,  8.,   16., 0.,  32.]]),
...     mps,
...     indices)
(array([[4.47034836e-08, 4.47034836e-08]]),  array([[0, 2]]), [array([1])],
 [array([ 80.      , 111.509775])])

snippets#

stumpy.snippets(T, m, k, percentage=1.0, s=None, mpdist_percentage=0.05, mpdist_k=None, normalize=True, p=2.0, mpdist_T_subseq_isconstant=None)[source]#

Identify the top k snippets that best represent the time series, T

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which to find the snippets

  • m (int) – The snippet window size

  • k (int) – The desired number of snippets

  • percentage (float, default 1.0) – With the length of each non-overlapping subsequence, S[i], set to m, this is the percentage of S[i] (i.e., percentage * m) to set s (the sub-subsequence length) to. When percentage == 1.0, then the full length of S[i] is used to compute the mpdist_vect. When percentage < 1.0, then a shorter sub-subsequence length of s = min(math.ceil(percentage * m), m) from each S[i] is used to compute mpdist_vect. When s is not None, then the percentage parameter is ignored.

  • s (int, default None) – With the length of each non-overlapping subsequence, S[i], set to m, this is essentially the sub-subsequence length (i.e., a shorter part of S[i]). When s == m, then the full length of S[i] is used to compute the mpdist_vect. When s < m, then shorter subsequences with length s from each S[i] is used to compute mpdist_vect. When s is not None, then the percentage parameter is ignored.

  • mpdist_percentage (float, default 0.05) – The percentage of distances that will be used to report mpdist. The value is between 0.0 and 1.0.

  • mpdist_k (int) – Specify the k`th value in the concatenated matrix profiles to return. When `mpdist_k is not None, then the mpdist_percentage parameter is ignored.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • mpdist_T_subseq_isconstant (numpy.ndarray or function, default None) – A boolean array that indicates whether a subsequence (of length s) in T is constant (True). Alternatively, a custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

Returns:

  • snippets (numpy.ndarray) – The top k snippets

  • snippets_indices (numpy.ndarray) – The index locations for each of top k snippets

  • snippets_profiles (numpy.ndarray) – The MPdist profiles for each of the top k snippets

  • snippets_fractions (numpy.ndarray) – The fraction of data that each of the top k snippets represents

  • snippets_areas (numpy.ndarray) – The area under the curve corresponding to each profile for each of the top k snippets

  • snippets_regimes (numpy.ndarray) – The index slices corresponding to the set of regimes for each of the top k snippets. The first column is the (zero-based) snippet index while the second and third columns correspond to the (inclusive) regime start indices and the (exclusive) regime stop indices, respectively.

Notes

DOI: 10.1109/ICBK.2018.00058

See Table I

Examples

>>> import stumpy
>>> import numpy as np
>>> stumpy.snippets(np.array([584., -11., 23., 79., 1001., 0., -19.]), m=3, k=2)
(array([[ 584.,  -11.,   23.],
        [  79., 1001.,    0.]]),
 array([0, 3]),
 array([[0.        , 3.2452632 , 3.00009263, 2.982409  , 0.11633857],
        [2.982409  , 2.69407392, 3.01719586, 0.        , 2.92154586]]),
array([0.6, 0.4]),
array([9.3441034 , 5.81050512]),
array([[0, 0, 1],
       [0, 2, 3],
       [0, 4, 5],
       [1, 1, 2],
       [1, 3, 4]]))

stimp#

stumpy.stimp(T, min_m=3, max_m=None, step=1, percentage=0.01, pre_scrump=True, normalize=True, p=2.0, T_subseq_isconstant_func=None)[source]#

Compute the Pan Matrix Profile

This is based on the SKIMP algorithm.

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which to compute the pan matrix profile

  • min_m (int, default 3) – The starting (or minimum) subsequence window size for which a matrix profile may be computed

  • max_m (int, default None) – The stopping (or maximum) subsequence window size for which a matrix profile may be computed. When max_m = Non, this is set to the maximum allowable subsequence window size

  • step (int, default 1) – The step between subsequence window sizes

  • percentage (float, default 0.01) – The percentage of the full matrix profile to compute for each subsequence window size. When percentage < 1.0, then the scrump algorithm is used. Otherwise, the stump algorithm is used when the exact matrix profile is requested.

  • pre_scrump (bool, default True) – A flag for whether or not to perform the PreSCRIMP calculation prior to computing SCRIMP. If set to True, this is equivalent to computing SCRIMP++. This parameter is ignored when percentage = 1.0.

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • T_subseq_isconstant_func (function, default None) – A custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.PAN_#

The transformed (i.e., normalized, contrasted, binarized, and repeated) pan matrix profile

Type:

numpy.ndarray

stumpy.M_#

The full list of (breadth first search (level) ordered) subsequence window sizes

Type:

numpy.ndarray

update():

Compute the next matrix profile using the next available (breadth-first-search (level) ordered) subsequence window size and update the pan matrix profile

See also

stumpy.stimped

Compute the Pan Matrix Profile with a distributed dask cluster

stumpy.gpu_stimp

Compute the Pan Matrix Profile with with one or more GPU devices

Notes

DOI: 10.1109/ICBK.2019.00031

See Table 2

Examples

>>> import stumpy
>>> import numpy as np
>>> pmp = stumpy.stimp(np.array([584., -11., 23., 79., 1001., 0., -19.]))
>>> pmp.update()
>>> pmp.PAN_
array([[0., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1.]])

stimped#

stumpy.stimped(client, T, min_m=3, max_m=None, step=1, normalize=True, p=2.0, T_subseq_isconstant_func=None)[source]#

Compute the Pan Matrix Profile with a distributed dask/ray cluster

This is based on the SKIMP algorithm.

Parameters:
  • client (client) – A Dask or Ray Distributed client. Setting up a distributed cluster is beyond the scope of this library. Please refer to the Dask or Ray Distributed documentation.

  • T (numpy.ndarray) – The time series or sequence for which to compute the pan matrix profile

  • min_m (int, default 3) – The starting (or minimum) subsequence window size for which a matrix profile may be computed

  • max_m (int, default None) – The stopping (or maximum) subsequence window size for which a matrix profile may be computed. When max_m = Non, this is set to the maximum allowable subsequence window size

  • step (int, default 1) – The step between subsequence window sizes

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. This parameter is ignored when normalize == True.

  • T_subseq_isconstant_func (function, default None) – A custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.PAN_#

The transformed (i.e., normalized, contrasted, binarized, and repeated) pan matrix profile

Type:

numpy.ndarray

stumpy.M_#

The full list of (breadth first search (level) ordered) subsequence window sizes

Type:

numpy.ndarray

update():

Compute the next matrix profile using the next available (breadth-first-search (level) ordered) subsequence window size and update the pan matrix profile

See also

stumpy.stimp

Compute the Pan Matrix Profile

stumpy.gpu_stimp

Compute the Pan Matrix Profile with with one or more GPU devices

Notes

DOI: 10.1109/ICBK.2019.00031

See Table 2

Examples

>>> import stumpy
>>> import numpy as np
>>> from dask.distributed import Client
>>> if __name__ == "__main__":
...     with Client() as dask_client:
...         pmp = stumpy.stimped(
...             dask_client,
...             np.array([584., -11., 23., 79., 1001., 0., -19.]))
...         pmp.update()
...         pmp.PAN_
array([[0., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1.]])

gpu_stimp#

stumpy.gpu_stimp(T, min_m=3, max_m=None, step=1, device_id=0, normalize=True, p=2.0, T_subseq_isconstant_func=None)#

Compute the Pan Matrix Profile with with one or more GPU devices

This is based on the SKIMP algorithm.

Parameters:
  • T (numpy.ndarray) – The time series or sequence for which to compute the pan matrix profile

  • min_m (int, default 3) – The starting (or minimum) subsequence window size for which a matrix profile may be computed

  • max_m (int, default None) – The stopping (or maximum) subsequence window size for which a matrix profile may be computed. When m_stop = Non, this is set to the maximum allowable subsequence window size

  • step (int, default 1) – The step between subsequence window sizes

  • device_id (int or list, default 0) – The (GPU) device number to use. The default value is 0. A list of valid device ids (int) may also be provided for parallel GPU-STUMP computation. A list of all valid device ids can be obtained by executing [device.id for device in numba.cuda.list_devices()].

  • normalize (bool, default True) – When set to True, this z-normalizes subsequences prior to computing distances. Otherwise, this function gets re-routed to its complementary non-normalized equivalent set in the @core.non_normalized function decorator.

  • p (float, default 2.0) – The p-norm to apply for computing the Minkowski distance. Minkowski distance is typically used with p being 1 or 2, which correspond to the Manhattan distance and the Euclidean distance, respectively. This parameter is ignored when normalize == True.

  • T_subseq_isconstant_func (function, default None) – A custom, user-defined function that returns a boolean array that indicates whether a subsequence in T is constant (True). The function must only take two arguments, a, a 1-D array, and w, the window size, while additional arguments may be specified by currying the user-defined function using functools.partial. Any subsequence with at least one np.nan/np.inf will automatically have its corresponding value set to False in this boolean array.

stumpy.PAN_#

The transformed (i.e., normalized, contrasted, binarized, and repeated) pan matrix profile

Type:

numpy.ndarray

stumpy.M_#

The full list of (breadth first search (level) ordered) subsequence window sizes

Type:

numpy.ndarray

update():

Compute the next matrix profile using the next available (breadth-first-search (level) ordered) subsequence window size and update the pan matrix profile

See also

stumpy.stimp

Compute the Pan Matrix Profile

stumpy.stimped

Compute the Pan Matrix Profile with a distributed dask cluster

Notes

DOI: 10.1109/ICBK.2019.00031

See Table 2

Examples

>>> import stumpy
>>> import numpy as np
>>> from numba import cuda
>>> if __name__ == "__main__":
...     all_gpu_devices = [device.id for device in cuda.list_devices()]
...     pmp = stumpy.gpu_stimp(
...         np.array([584., -11., 23., 79., 1001., 0., -19.]),
...         device_id=all_gpu_devices)
...     pmp.update()
...     pmp.PAN_
array([[0., 1., 1., 1., 1., 1., 1.],
       [0., 1., 1., 1., 1., 1., 1.]])