STUMPY documentationÂ¶
Background & MotivationÂ¶
The following video from May 2019, provides the background and motivation for developing and open sourcing STUMPY:
InstallationÂ¶
Using conda/pipÂ¶
Conda install (preferred):
conda install c condaforge stumpy
PyPI install with pip
:
python m pip install stumpy
From sourceÂ¶
To install stumpy from source, youâ€™ll need to install the dependencies above. For maximum performance, it is recommended that you install all dependencies using conda:
conda install c condaforge y numpy scipy numba
Alternatively, but with lower performance, you can also install these dependencies using the requirements.txt file in the root of this repository:
python m pip install r requirements.txt
Once the dependencies are installed (stay inside of the stumpy
directory), execute:
python m pip install .
STUMPY APIÂ¶
Overview
stumpy.stump 
Compute the matrix profile with parallelized STOMP 
stumpy.stumped 
Compute the matrix profile with parallelized and distributed STOMPopt with Pearson correlations. 
stumpy.gpu_stump 
Compute the matrix profile with GPUSTOMP 
stumpy.scrump 
Compute the approximate matrix profile with the parallelized SCRIMP algorthm. 
stumpy.stumpi 
Compute an incremental matrix profile for streaming data. 
stumpy.mstump 
Compute the multidimensional matrix profile with parallelized mSTOMP 
stumpy.mstumped 
Compute the multidimensional matrix profile with parallelized and distributed mSTOMP 
stumpy.atsc 
Compute the anchored time series chain (ATSC). 
stumpy.allc 
Compute the allchain set (ALLC). 
stumpy.fluss 
Compute the Fast Lowcost Unipotent Semantic Segmentation (FLUSS) for static data. 
stumpy.floss 
Compute the Fast Lowcost Online Semantic Segmentation (FLOSS) for streaming data. 
stumpÂ¶

stumpy.
stump
(T_A, m, T_B=None, ignore_trivial=True)[source]Â¶ Compute the matrix profile with parallelized STOMP
This is a convenience wrapper around the Numba JITcompiled parallelized _stump function which computes the matrix profile according to STOMPopt with Pearson correlations.
Parameters:  T_A (ndarray) â€“ The time series or sequence for which to compute the matrix profile
 m (int) â€“ Window size
 T_B (ndarray) â€“ The time series or sequence that contain your query subsequences of interest. Default is None which corresponds to a selfjoin.
 ignore_trivial (bool) â€“ Set to True if this is a selfjoin. Otherwise, for ABjoin, set this to False. Default is True.
Returns: out â€“ The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.
Return type: ndarray
Notes
DOI: 10.1007/s101150171138x
See Section 4.5
The above reference outlines a general approach for traversing the distance matrix in a diagonal fashion rather than in a rowwise fashion.
See Section 3.1 and Section 3.3
The above reference outlines the use of the Pearson correlation via Welfordâ€™s centered sumofproducts along each diagonal of the distance matrix in place of the sliding window dot product found in the original STOMP method.
See Table II
Timeseries, T_B, will be annotated with the distance location (or index) of all its subsequences in another times series, T_A.
Return: For every subsequence, Q, in T_B, you will get a distance and index for the closest subsequence in T_A. Thus, the array returned will have length T_B.shape[0]m+1. Additionally, the left and right matrix profiles are also returned.
Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.
Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).
For selfjoins, set ignore_trivial = True in order to avoid the trivial match.
Note that left and right matrix profiles are only available for selfjoins.
stumpedÂ¶

stumpy.
stumped
(dask_client, T_A, m, T_B=None, ignore_trivial=True)[source]Â¶ Compute the matrix profile with parallelized and distributed STOMPopt with Pearson correlations.
This is highly distributed implementation around the Numba JITcompiled parallelized _stump function which computes the matrix profile according to STOMPopt with Pearson correlations.
Parameters:  dask_client (client) â€“ A Dask Distributed client that is connected to a Dask scheduler and Dask workers. Setting up a Dask distributed cluster is beyond the scope of this library. Please refer to the Dask Distributed documentation.
 T_A (ndarray) â€“ The time series or sequence for which to compute the matrix profile
 m (int) â€“ Window size
 T_B (ndarray) â€“ The time series or sequence that contain your query subsequences of interest. Default is None which corresponds to a selfjoin.
 ignore_trivial (bool) â€“ Set to True if this is a selfjoin. Otherwise, for ABjoin, set this to False. Default is True.
Returns: out â€“ The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.
Return type: ndarray
Notes
DOI: 10.1007/s101150171138x
See Section 4.5
The above reference outlines a general approach for traversing the distance matrix in a diagonal fashion rather than in a rowwise fashion.
See Section 3.1 and Section 3.3
The above reference outlines the use of the Pearson correlation via Welfordâ€™s centered sumofproducts along each diagonal of the distance matrix in place of the sliding window dot product found in the original STOMP method.
See Table II
This is a Dask distributed implementation of stump that scales across multiple servers and is a convenience wrapper around the parallelized stump._stump function
Timeseries, T_B, will be annotated with the distance location (or index) of all its subsequences in another times series, T_A.
Return: For every subsequence, Q, in T_B, you will get a distance and index for the closest subsequence in T_A. Thus, the array returned will have length T_B.shape[0]m+1. Additionally, the left and right matrix profiles are also returned.
Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.
Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).
For selfjoins, set ignore_trivial = True in order to avoid the trivial match.
Note that left and right matrix profiles are only available for selfjoins.
gpustumpÂ¶

stumpy.
gpu_stump
(*args, **kwargs)Â¶ Compute the matrix profile with GPUSTOMP
This is a convenience wrapper around the Numba cuda.jit _gpu_stump function which computes the matrix profile according to GPUSTOMP.
Parameters:  T_A (ndarray) â€“ The time series or sequence for which to compute the matrix profile
 m (int) â€“ Window size
 T_B ((optional) ndarray) â€“ The time series or sequence that contain your query subsequences of interest. Default is None which corresponds to a selfjoin.
 ignore_trivial (bool) â€“ Set to True if this is a selfjoin. Otherwise, for ABjoin, set this to False. Default is True.
 device_id (int or list) â€“ The (GPU) device number to use. The default value is 0. A list of valid device ids (int) may also be provided for parallel GPUSTUMP computation. A list of all valid device ids can be obtained by executing [device.id for device in cuda.list_devices()].
Returns: out â€“ The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.
Return type: ndarray
Notes
See Table II, Figure 5, and Figure 6
Timeseries, T_B, will be annotated with the distance location (or index) of all its subsequences in another times series, T_A.
Return: For every subsequence, Q, in T_B, you will get a distance and index for the closest subsequence in T_A. Thus, the array returned will have length T_B.shape[0]m+1. Additionally, the left and right matrix profiles are also returned.
Note: Unlike in the Table II where T_A.shape is expected to be equal to T_B.shape, this implementation is generalized so that the shapes of T_A and T_B can be different. In the case where T_A.shape == T_B.shape, then our algorithm reduces down to the same algorithm found in Table II.
Additionally, unlike STAMP where the exclusion zone is m/2, the default exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).
For selfjoins, set ignore_trivial = True in order to avoid the trivial match.
Note that left and right matrix profiles are only available for selfjoins.
scrumpÂ¶

stumpy.
scrump
(T_A, m, T_B=None, ignore_trivial=True, percentage=0.01, pre_scrump=False, s=None)[source]Â¶ Compute the approximate matrix profile with the parallelized SCRIMP algorthm. For SCRIMP++, set pre_scrump=True.
This is a convenience wrapper around the Numba JITcompiled parallelized _stump function which computes the matrix profile according to SCRIMP.
Parameters:  T_A (ndarray) â€“ The time series or sequence for which to compute the matrix profile
 T_B (ndarray) â€“ The time series or sequence that contain your query subsequences of interest
 m (int) â€“ Window size
 ignore_trivial (bool) â€“ Set to True if this is a selfjoin. Otherwise, for ABjoin, set this to False. Default is True.
 percentage (float) â€“ Approximate percentage completed. The value is between 0.0 and 1.0.
 pre_scrump (bool) â€“ A flag for whether or not to perform the PreSCRIMP calculation prior to computing SCRIMP. If set to True, this is equivalent to computing SCRIMP++
 s (int) â€“ The size of the PreSCRIMP fixed interval. If prescrump=True and s=None, then s will automatically be set to s=int(np.ceil(m/4)), the size of the exclusion zone.

stumpy.
P_
Â¶ The updated matrix profile
Type: ndarray

stumpy.
I_
Â¶ The updated matrix profile indices
Type: ndarray

stumpy.
update
()Â¶ Update the matrix profile and the matrix profile indices by computing additional new distances (limited by percentage) that make up the full distance matrix.
Notes
See Algorithm 1 and Algorithm 2
stumpiÂ¶

stumpy.
stumpi
(T, m, excl_zone=None)[source]Â¶ Compute an incremental matrix profile for streaming data. This is based on the online STOMPI and STAMPI algorithms.
Parameters: 
stumpy.
P_
The updated matrix profile for T
Type: ndarray

stumpy.
I_
The updated matrix profile indices for T
Type: ndarray

stumpy.
left_P_
Â¶ The updated left matrix profile for T
Type: ndarray

stumpy.
left_I_
Â¶ The updated left matrix profile indices for T
Type: ndarray

stumpy.
T_
Â¶ The updated time series or sequence for which the matrix profile and matrix profile indices are computed
Type: ndarray

stumpy.
update
(t) Append a single new data point, t, to the time series, T, and update the matrix profile
Notes
DOI: 10.1007/s1061801705199
See Table V
Note that line 11 is missing an important sqrt operation!

mstumpÂ¶

stumpy.
mstump
(T, m, include=None, discords=False)[source]Â¶ Compute the multidimensional matrix profile with parallelized mSTOMP
This is a convenience wrapper around the Numba JITcompiled parallelized _mstump function which computes the multidimensional matrix profile and multidimensional matrix profile index according to mSTOMP, a variant of mSTAMP. Note that only selfjoins are supported.
Parameters:  T (ndarray) â€“ The time series or sequence for which to compute the multidimensional matrix profile. Each row in T represents data from a different dimension while each column in T represents data from the same dimension.
 m (int) â€“ Window size
 include (list, ndarray) â€“
A list of (zerobased) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:
 discords (bool) â€“ When set to True, this reverses the distance matrix which results in a multidimensional matrix profile that favors larger matrix profile values (i.e., discords) rather than smaller values (i.e., motifs). Note that indices in include are still maintained and respected.
Returns:  P (ndarray) â€“ The multidimensional matrix profile. Each column of the array corresponds to each matrix profile for a given dimension (i.e., the first column is the 1D matrix profile and the second column is the 2D matrix profile).
 I (ndarray) â€“ The multidimensional matrix profile index where each column of the array corresponds to each matrix profile index for a given dimension.
Notes
See mSTAMP Algorithm
mstumpedÂ¶

stumpy.
mstumped
(dask_client, T, m, include=None, discords=False)[source]Â¶ Compute the multidimensional matrix profile with parallelized and distributed mSTOMP
This is a highly distributed implementation around the Numba JITcompiled parallelized _mstump function which computes the multidimensional matrix profile according to STOMP. Note that only selfjoins are supported.
Parameters:  dask_client (client) â€“ A Dask Distributed client that is connected to a Dask scheduler and Dask workers. Setting up a Dask distributed cluster is beyond the scope of this library. Please refer to the Dask Distributed documentation.
 T (ndarray) â€“ The time series or sequence for which to compute the multidimensional matrix profile. Each row in T represents data from a different dimension while each column in T represents data from the same dimension.
 m (int) â€“ Window size
 include (list, ndarray) â€“
A list of (zerobased) indices corresponding to the dimensions in T that must be included in the constrained multidimensional motif search. For more information, see Section IV D in:
 discords (bool) â€“ When set to True, this reverses the distance matrix which results in a multidimensional matrix profile that favors larger matrix profile values (i.e., discords) rather than smaller values (i.e., motifs). Note that indices in include are still maintained and respected.
Returns:  P (ndarray) â€“ The multidimensional matrix profile. Each column of the array corresponds to each matrix profile for a given dimension (i.e., the first column is the 1D matrix profile and the second column is the 2D matrix profile).
 I (ndarray) â€“ The multidimensional matrix profile index where each column of the array corresponds to each matrix profile index for a given dimension.
Notes
See mSTAMP Algorithm
atscÂ¶

stumpy.
atsc
(IL, IR, j)[source]Â¶ Compute the anchored time series chain (ATSC).
Parameters:  IL (ndarray) â€“ Left matrix profile indices
 IR (ndarray) â€“ Right matrix profile indices
 j (int) â€“ The index value for which to compute the ATSC
Returns: output â€“ Anchored time series chain for index, j
Return type: ndarray
Notes
See Table I
This is the implementation for the anchored time series chains (ATSC).
Unlike the original paper, weâ€™ve replaced the whileloop with a more stable forloop.
allcÂ¶

stumpy.
allc
(IL, IR)[source]Â¶ Compute the allchain set (ALLC).
Parameters:  IL (ndarray) â€“ Left matrix profile indices
 IR (ndarray) â€“ Right matrix profile indices
Returns:  S (list(ndarray)) â€“ Allchain set
 C (ndarray) â€“ Anchored time series chain for the longest chain (also known as the unanchored chain)
Notes
See Table II
Unlike the original paper, weâ€™ve replaced the whileloop with a more stable forloop.
This is the implementation for the allchain set (ALLC) and the unanchored chain is simply the longest one among the allchain set. Both the allchain set and unanchored chain are returned.
The allchain set, S, is returned as a list of unique numpy arrays.
flussÂ¶

stumpy.
fluss
(I, L, n_regimes, excl_factor=5, custom_iac=None)[source]Â¶ Compute the Fast Lowcost Unipotent Semantic Segmentation (FLUSS) for static data.
Essentially, this is a wrapper to compute the corrected arc curve and regime locations.
Parameters:  I (ndarray) â€“ The matrix profile indices for the time series of interest
 L (int) â€“ The subsequence length that is set roughly to be one period length. This is likely to be the same value as the window size, m, used to compute the matrix profile and matrix profile index but it can be different since this is only used to manage edge effects and has no bearing on any of the IAC or CAC core calculations.
 n_regimes (int) â€“ The number of regimes to search for. This is one more than the number of regime changes as denoted in the original paper.
 m (int) â€“ The subsequence length. This is expected to be the same value as the window size used to compute the matrix profile and matrix profile index.
 excl_factor (int) â€“ The multiplying factor for the regime exclusion zone
 custom_iac (np.array) â€“ A custom idealized arc curve (IAC) that will used for correcting the arc curve
Returns:  cac (ndarray) â€“ A corrected arc curve (CAC)
 regime_locs (ndarray) â€“ The locations of the regimes
Notes
See Section A
This is the implementation for Fast Lowcost Unipotent Semantic Segmentation (FLUSS).
flossÂ¶

stumpy.
floss
(mp, T, m, L, excl_factor=5, n_iter=1000, n_samples=1000, custom_iac=None)[source]Â¶ Compute the Fast Lowcost Online Semantic Segmentation (FLOSS) for streaming data.
Parameters:  mp (ndarray) â€“ The first column consists of the matrix profile, the second column consists of the matrix profile indices, the third column consists of the left matrix profile indices, and the fourth column consists of the right matrix profile indices.
 T (ndarray) â€“ A 1D time series data used to generate the matrix profile and matrix profile indices found in mp. Note that the the right matrix profile index is used and the right matrix profile is intelligently recomputed on the fly from T instead of using the bidirectional matrix profile.
 m (int) â€“ The window size for computing sliding window mass. This is identical to the window size used in the matrix profile calculation. For managing edge effects, see the L parameter.
 L (int) â€“ The subsequence length that is set roughly to be one period length. This is likely to be the same value as the window size, m, used to compute the matrix profile and matrix profile index but it can be different since this is only used to manage edge effects and has no bearing on any of the IAC or CAC core calculations.
 excl_factor (int) â€“ The multiplying factor for the regime exclusion zone. Note that this is unrelated to the excl_zone used in to compute the matrix profile.
 n_iter (int) â€“ Number of iterations to average over when determining the parameters for the IAC beta distribution
 n_samples (int) â€“ Number of distribution samples to draw during each iteration when computing the IAC
 custom_iac (np.array) â€“ A custom idealized arc curve (IAC) that will used for correcting the arc curve

stumpy.
cac_1d_
Â¶ A 1dimensional corrected arc curve (CAC) updated as a result of ingressing a single new data point and egressing a single old data point.
Type: ndarray

stumpy.
P_
The matrix profile updated as a result of ingressing a single new data point and egressing a single old data point.
Type: ndarray

stumpy.
I_
The (right) matrix profile indices updated as a result of ingressing a single new data point and egressing a single old data point.
Type: ndarray

stumpy.
T_
The updated time series, T
Type: ndarray

stumpy.
update
(t) Ingress a new data point, t, onto the time series, T, followed by egressing the oldest single data point from T. Then, update the 1dimensional corrected arc curve (CAC_1D) and the matrix profile.
Notes
DOI: 10.1109/ICDM.2017.21 <https://www.cs.ucr.edu/~eamonn/Segmentation_ICDM.pdf>`__
See Section C
This is the implementation for Fast Lowcost Online Semantic Segmentation (FLOSS).
TutorialsÂ¶
The Matrix ProfileÂ¶
Laying the FoundationÂ¶
At its core, the STUMPY library efficiently computes something called a matrix profile, a vector that stores the znormalized Euclidean distance between any subsequence within a time series and its nearest neighbor.
To fully understand what this means, letâ€™s take a step back and start with a simple illustrative example along with a few basic definitions:
Time Series with Length n = 13Â¶
[1]:
time_series = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]
n = len(time_series)
To analyze this time series with length n = 13
, we could visualize the data or calculate global summary statistics (i.e., mean, median, mode, min, max). If you had a much longer time series, then you may even feel compelled to build an ARIMA model, perform anomaly detection, or attempt a forecasting model but these methods can be complicated and may often have false positives or no interpretable insights.
However, if we were to apply Occamâ€™s Razor, then what is the most simple and intuitive approach that we could take analyze to this time series?
To answer this question, letâ€™s start with our first defintion:
Subsequence /ËˆsÉ™bsÉ™kwÉ™ns/ nounÂ¶
a part or section of the full time seriesÂ¶
So, the following are all considered subsequences of our time_series
since they can all be found in the time series above.
[2]:
print(time_series[0:2])
print(time_series[4:7])
print(time_series[2:10])
[0, 1]
[9, 1, 14]
[3, 2, 9, 1, 14, 15, 1, 2]
We can see that each subsequence can have a different sequence length that weâ€™ll call m
. So, for example, if we choose m = 4
, then we can think about how we might compare any two subsequences of the same length.
[3]:
m = 4
i = 0 # starting index for the first subsequence
j = 8 # starting index for the second subsequence
subseq_1 = time_series[i:i+m]
subseq_2 = time_series[j:j+m]
print(subseq_1, subseq_2)
[0, 1, 3, 2] [1, 2, 2, 10]
One way to compare any two subsequences is to calculate what is called the Euclidean distance.
Euclidean Distance /yoÍžoËˆklidÄ“É™n/ /ËˆdistÉ™ns/ nounÂ¶
the straightline distance between two pointsÂ¶
[4]:
import math
D = 0
for k in range(m):
D += (time_series[i+k]  time_series[j+k])**2
print(f"The square root of {D} = {math.sqrt(D)}")
The square root of 67 = 8.18535277187245
Distance Profile  Pairwise Euclidean DistancesÂ¶
Now, we can take this a step further where we keep one subsequence the same (reference subsequence), change the second subsequence in a sliding window manner, and compute the Euclidean distance for each window. The resulting vector of pairwise Euclidean distances is also known as a distance profile.
Of course, not all of these distances are useful. Specifically, the distance for the self match (or trivial match) isnâ€™t informative since the distance will be always be zero when you are comparing a subsequence with itself. So, weâ€™ll ignore it and, instead, take note of the next smallest distance from the distance profile and choose that as our best match:
Next, we can shift our reference subsequence over one element at a time and repeat the same sliding window process to compute the distance profile for each new reference subsequence.
Distance MatrixÂ¶
If we take all of the distance profiles that were computed for each reference subsequence and stack them one on top of each other then we get something called a distance matrix
Now, we can simplify this distance matrix by only looking at the nearest neighbor for each subsequence and this takes us to our next concept:
Matrix Profile /ËˆmÄtriks/ /ËˆprÅËŒfÄ«l/ nounÂ¶
a vector that stores the (znormalized) Euclidean distance between any subsequence within a time series and its nearest neighborÂ¶
Practically, what this means is that the matrix profile is only interested in storing the smallest nontrivial distances from each distance profile, which significantly reduces the spatial complexity to O(n):
We can now plot this matrix profile underneath our original time series. And, as it turns out, a reference subsequence with a small matrix profile value (i.e., it has a nearest neighbor significantly â€œclosebyâ€) may indicate a possible pattern while a reference subsequence with a large matrix profile value (i.e., its nearest neighbor is significantly â€œfarawayâ€) may suggest the presence of an anomaly.
So, by simply computing and inspecting the matrix profile alone, one can easily pick out the top pattern (global minimum) and rarest anomaly (global maximum). And this is only a small glimpse into what is possible once youâ€™ve computed the matrix profile!
The Real Problem  The Brute Force ApproachÂ¶
Now, it might seem pretty straightforward at this point but what we need to do is consider how to compute the full distance matrix efficiently. Letâ€™s start with the brute force approach:
[5]:
for i in range(nm+1):
for j in range(nm+1):
D = 0
for k in range(m):
D += (time_series[i+k]  time_series[j+k])**2
D = math.sqrt(D)
At first glance, this may not look too bad but if we start considering both the computational complexity as well as the spatial complexity then we begin to understand the real problem. It turns out that, for longer time series (i.e., n >> 10,000) the computational complexity is O(n2m) (as evidenced by the three for loops in the code above) and the spatial complexity for storing the full distance matrix is O(n2).
To put this into perspective, imagine if you had a single sensor that collected data 20 times/min over the course of 5 years. This would result:
[6]:
n = 20 * 60 * 24 * 364 * 5 # 20 times/min x 60 mins/hour x 24 hours/day x 365 days/year x 5 years
print(f"There would be n = {n} data points")
There would be n = 52416000 data points
Assuming that each calculation in the inner loop takes 0.0000001 seconds then this would take:
[7]:
time = 0.0000001 * (n * n  n)/2
print(f"It would take {time} seconds to compute")
It would take 137371850.1792 seconds to compute
Which is equivalent to 1,598.7 days (or 4.4 years) and 11.1 PB of memory to compute! So, it is clearly not feasible to compute the distance matrix using our naive brute force method. Instead, we need to figure out how to reduce this computational complexity by efficiently generating a matrix profile and this is where STUMPY comes into play.
STUMPYÂ¶
In the fall of 2016, researchers from the University of California, Riverside and the University of New Mexico published a beautiful set of backtoback papers that described an exact method called STOMP for computing the matrix profile for any time series with a computational complexity of O(n2)! They also further demonstrated this using GPUs and they called this faster approach GPUSTOMP.
With the academics, data scientists, and developers in mind, we have taken these concepts and have open sourced STUMPY, a powerful and scalable library that efficiently computes the matrix profile according to this published research. And, thanks to other open source software such as Numba and Dask, our implementation is highly parallelized (for a single server with multiple CPUs or, alternatively, multiple GPUs), highly distributed (with multiple CPUs across multiple servers). Weâ€™ve tested STUMPY on as many as 256 CPU cores (spread across 32 servers) or 16 NVIDIA GPU devices (on the same DGX2 server) and have achieved similar performance to the published GPUSTOMP work.
ConclusionÂ¶
According to the original authors, â€œthese are the best ideas in times series data mining in the last two decadesâ€ and â€œgiven the matrix profile, most time series data mining problems are trivial to solve in a few lines of codeâ€.
From our experience, this is definitely true and we are excited to share STUMPY with you! Please reach out and let us know how STUMPY has enabled your time series analysis work as weâ€™d love to hear from you!
Additional NotesÂ¶
For the sake of completeness, weâ€™ll provide a few more comments for those of you whoâ€™d like to compare your own matrix profile implementation to STUMPY. However, due to the many details that are omitted in the original papers, we strongly encourage you to use STUMPY.
In our explanation above, weâ€™ve only excluded the trivial match from consideration. However, this is insufficient since nearby subsequences (i.e., i Â± 1
) are likely highly similar and we need to expand this to a larger â€œexclusion zoneâ€ relative to the diagonal trivial match. Here, we can visualize what different exclusion zones look like:
However, in practice, it has been found that an exclusion zone of i Â± int(np.ceil(m / 4))
works well (where m
is the subsequence window size) and the distances computed in this region are is set to np.inf
before the matrix profile value is extracted for the ith
subsequence. Thus, the larger the window size is, the larger the exclusion zone will be. Additionally, note that, since NumPy indexing has an inclusive start index but an exlusive stop index, the proper way to ensure a
symmetrical exclusion zone is:
excl_zone = int(np.ceil(m / 4))
zone_start = i  excl_zone
zone_end = i + excl_zone + 1 # Notice that we add one since this is exclusive
distance_profile[zone_start : zone_end] = np.inf
STUMPY BasicsÂ¶
Analyzing Motifs and Anomalies with STUMPÂ¶
This example utilizes the main takeaways from the research papers: Matrix Profile I & Matrix Profile II.
To explore the basic concepts, weâ€™ll use the workhorse stump
function to find interesting motifs (patterns) or discords (anomalies/novelties) and demonstrate these concepts with two different time series datasets:
 The Steamgen dataset
 The NYC taxi passengers dataset
stump
is Numba JITcompiled version of the popular STOMP algorithm that is described in detail in the original Matrix Profile II paper. stump
is capable of parallel computation and it performs an ordered search for patterns and outliers within a specified time series and takes advantage of the locality of some calculations to minimize the runtime.
Getting StartedÂ¶
Letâ€™s import the packages that weâ€™ll need to load, analyze, and plot the data.
[1]:
%matplotlib inline
import pandas as pd
import stumpy
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
from matplotlib.patches import Rectangle
import datetime as dt
import urllib
import ssl
import io
import os
The python functions below are going to be used throughout this example to automatically resize and create the plots that are displayed using the Matplotlib python plotting package.
[2]:
def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams['xtick.direction'] = 'out'
change_plot_size(20, 6, plt)
What is a Motif?Â¶
Time series motifs are approximately repeated subsequences found within a longer time series. Being able to say that a subsequence is â€œapproximately repeatedâ€ requires that you be able to compare subsequences to each other. In the case of STUMPY, all subsequences within a time series can be compared by computing the pairwise znormalized Euclidean distances and then storing only the index to its nearest neighbor. This nearest neighbor distance vector is referred to as the matrix profile
and
the index to each nearest neighbor within the time series is referred to as the matrix profile index
. Luckily, the stump
function takes in any time series (with floating point values) and computes the matrix profile along with the matrix profile indices and, in turn, one can immediately find time series motifs. Letâ€™s look at an example:
Loading the Steamgen DatasetÂ¶
This data was generated using fuzzy models applied to mimic a steam generator at the Abbott Power Plant in Champaign, IL. The data feature that we are interested in is the output steam flow telemetry that has units of kg/s and the data is â€œsampledâ€ every three seconds with a total of 9,600 datapoints.
[3]:
colnames = ['drum pressure',
'excess oxygen',
'water level',
'steam flow'
]
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = 'https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat'
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
steam_df = pd.read_csv(data, header=None, sep="\\s+")
steam_df.columns = colnames
steam_df.head()
[3]:
drum pressure  excess oxygen  water level  steam flow  

0  320.08239  2.506774  0.032701  9.302970 
1  321.71099  2.545908  0.284799  9.662621 
2  320.91331  2.360562  0.203652  10.990955 
3  325.00252  0.027054  0.326187  12.430107 
4  326.65276  0.285649  0.753776  13.681666 
Visualizing the Steamgen DatasetÂ¶
[4]:
plt.suptitle('Steamgen Dataset', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Steam Flow', fontsize='20')
plt.plot(steam_df['steam flow'].values)
[4]:
[<matplotlib.lines.Line2D at 0x10c367320>]
Take a moment and carefully examine the plot above with your naked eye. If you were told that there was a pattern that was approximately repeated, can you spot it? Even for a computer, this can be very challenging. Hereâ€™s what you should be looking for:
Manually Finding a MotifÂ¶
[5]:
m = 640
fig, axs = plt.subplots(2)
plt.suptitle('Steamgen Dataset', fontsize='30')
axs[0].set_ylabel("Steam Flow", fontsize='20')
axs[0].plot(steam_df['steam flow'], alpha=0.5, linewidth=1)
axs[0].plot(steam_df['steam flow'].iloc[643:643+m])
axs[0].plot(steam_df['steam flow'].iloc[8724:8724+m])
rect = Rectangle((643, 0), m, 40, facecolor='lightgrey')
axs[0].add_patch(rect)
rect = Rectangle((8724, 0), m, 40, facecolor='lightgrey')
axs[0].add_patch(rect)
axs[1].set_xlabel("Time", fontsize='20')
axs[1].set_ylabel("Steam Flow", fontsize='20')
axs[1].plot(steam_df['steam flow'].values[643:643+m], color='C1')
axs[1].plot(steam_df['steam flow'].values[8724:8724+m], color='C2')
[5]:
[<matplotlib.lines.Line2D at 0x1297e38d0>]
The motif (pattern) that we are looking for is highlighted above and yet it is still very hard to be certain that the orange and green subsequences are a match (upper panel), that is, until we zoom in on them and overlay the subsequences on top each other (lower panel). Now, we can clearly see that the motif is very similar! The fundamental value of computing the matrix profile is that it not only allows you to quickly find motifs but it also identifies the nearest neighbor for all subsequences
within your time series. Note that we havenâ€™t actually done anything special here to locate the motif except that we grab the locations from the original paper and plotted them. Now, letâ€™s take our steamgen data and apply the stump
function to it:
Find a Motif Using STUMPÂ¶
[6]:
m = 640
mp = stumpy.stump(steam_df['steam flow'], m)
stump
requires two parameters:
 A time series
 A window size,
m
In this case, based on some domain expertise, weâ€™ve chosen m = 640
, which is roughly equivalent to halfhour windows. And, again, the output of stump
is an array that contains all of the matrix profile values (i.e., znormalized Euclidean distance to your nearest neighbor) and matrix profile indices in the first and second columns, respectively (weâ€™ll ignore the third and fourth columns for now). Letâ€™s plot the matrix profile next to our raw data:
[7]:
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
plt.suptitle('Motif (Pattern) Discovery', fontsize='30')
axs[0].plot(steam_df['steam flow'].values)
axs[0].set_ylabel('Steam Flow', fontsize='20')
rect = Rectangle((643, 0), m, 40, facecolor='lightgrey')
axs[0].add_patch(rect)
rect = Rectangle((8724, 0), m, 40, facecolor='lightgrey')
axs[0].add_patch(rect)
axs[1].set_xlabel('Time', fontsize ='20')
axs[1].set_ylabel('Matrix Profile', fontsize='20')
axs[1].axvline(x=643, linestyle="dashed")
axs[1].axvline(x=8724, linestyle="dashed")
axs[1].plot(mp[:, 0])
[7]:
[<matplotlib.lines.Line2D at 0x12b5bcc50>]
What we learn is that the global minima (vertical dashed lines) from the matrix profile correspond to the locations of the two subsequences that make up the motif pair! And the exact znormalized Euclidean distance between these two subsequences is:
[8]:
mp[:, 0].min()
[8]:
5.49161982776594
So, this distance isnâ€™t zero since we saw that the two subsequences arenâ€™t an identical match but, relative to the rest of the matrix profile (i.e., compared to either the mean or median matrix profile values), we can understand that this motif is a significantly good match.
Find Anomalies using STUMPÂ¶
Conversely, the maximum value in the matrix profile (computed from stump
above) is:
[9]:
mp[:, 0].max()
[9]:
23.47616836730375
The matrix profile index also tells us which subsequence within the time series does not have nearest neighbor that resembles itself:
[10]:
np.argwhere(mp[:, 0] == mp[:, 0].max()).flatten()[0]
[10]:
3864
The subsequence located at this global maximum is also referred to as a discord, novelty, or anomaly:
[11]:
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
plt.suptitle('Discord (Anomaly/Novelty) Discovery', fontsize='30')
axs[0].plot(steam_df['steam flow'].values)
axs[0].set_ylabel('Steam Flow', fontsize='20')
rect = Rectangle((3864, 0), m, 40, facecolor='lightgrey')
axs[0].add_patch(rect)
axs[1].set_xlabel('Time', fontsize ='20')
axs[1].set_ylabel('Matrix Profile', fontsize='20')
axs[1].axvline(x=3864, linestyle="dashed")
axs[1].plot(mp[:, 0])
[11]:
[<matplotlib.lines.Line2D at 0x12b62efd0>]
Now that youâ€™ve mastered the STUMPY basics and understand how to discover motifs and anomalies from a time series, weâ€™ll leave it up to you to investigate other interesting local minima and local maxima in the steamgen dataset.
To further develop/reinforce our growing intuition, letâ€™s move on and explore another dataset!
Loading the NYC Taxi Passengers DatasetÂ¶
First, weâ€™ll download historical data that represents the halfhourly average of the number of NYC taxi passengers over 75 days in the Fall of 2014.
We extract that data and insert it into a pandas dataframe, making sure the timestamps are stored as datetime objects and the values are of type float64. Note that weâ€™ll do a little more data cleaning than above just so you can see an example where the timestamp is included. But be aware that stump
does not actually use or need the timestamp column at all when computing the matrix profile.
[12]:
taxi_df = pd.read_csv("https://raw.githubusercontent.com/stanfordfuturedata/ASAP/master/Taxi.csv", sep=',')
taxi_df['value'] = taxi_df['value'].astype(np.float64)
taxi_df['timestamp'] = pd.to_datetime(taxi_df['timestamp'])
taxi_df.head()
[12]:
timestamp  value  

0  20141001 00:00:00  12751.0 
1  20141001 00:30:00  8767.0 
2  20141001 01:00:00  7005.0 
3  20141001 01:30:00  5257.0 
4  20141001 02:00:00  4189.0 
Visualizing the Taxi DatasetÂ¶
[13]:
# This code is going to be utilized to control the axis labeling of the plots
DAY_MULTIPLIER = 7 # Specify for the amount of days you want between each labeled xaxis tick
x_axis_labels = taxi_df[(taxi_df.timestamp.dt.hour==0)]['timestamp'].dt.strftime('%b %d').values[::DAY_MULTIPLIER]
x_axis_labels[1::2] = " "
x_axis_labels, DAY_MULTIPLIER
plt.suptitle('Taxi Passenger Raw Data', fontsize='30')
plt.xlabel('Window Start Date', fontsize ='20')
plt.ylabel('HalfHourly Average\nNumber of Taxi Passengers', fontsize='20')
plt.plot(taxi_df['value'])
plt.xticks(np.arange(0, taxi_df['value'].shape[0], (48*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.minorticks_on()
plt.margins(x=0)
plt.show()
It seems as if there is a general periodicity between spans of 1day and 7days, which can likely be explained by the fact that more people use taxis throughout the day than through the night and that it is reasonable to say most weeks have similar taxirider patterns. Also, maybe there is an outlier just to the right of the window starting near the end of October but, other than that, there isnâ€™t anything you can conclude from just looking at the raw data.
Generating the Matrix ProfileÂ¶
Again, defining the window size, m
, usually requires some level of domain knowledge but weâ€™ll demonstrate later on that stump
is robust to changes in this parameter. Since this data was taken halfhourly, we chose a value m = 48
to represent the span of exactly one day:
[14]:
m = 48
mp = stumpy.stump(taxi_df['value'], m=m)
Visualizing the Matrix ProfileÂ¶
[15]:
plt.suptitle('1Day STUMP', fontsize='30')
plt.xlabel('Window Start', fontsize ='20')
plt.ylabel('Matrix Profile', fontsize='20')
plt.plot(mp[:, 0])
plt.plot(575, 1.7, marker="v", markersize=15, color='b')
plt.text(620, 1.6, 'Columbus Day', color="black", fontsize=20)
plt.plot(1535, 3.7, marker="v", markersize=15, color='b')
plt.text(1580, 3.6, 'Daylight Savings', color="black", fontsize=20)
plt.plot(2700, 3.1, marker="v", markersize=15, color='b')
plt.text(2745, 3.0, 'Thanksgiving', color="black", fontsize=20)
plt.plot(30, .2, marker="^", markersize=15, color='b', fillstyle='none')
plt.plot(363, .2, marker="^", markersize=15, color='b', fillstyle='none')
plt.xticks(np.arange(0, 3553, (m*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.minorticks_on()
plt.show()
Understanding the Matrix ProfileÂ¶
Letâ€™s understand what weâ€™re looking at.
Lowest ValuesÂ¶
The lowest values (open triangles) are considered a motif since they represent the pair of nearest neighbor subsequences with the smallest znormalized Euclidean distance. Interestingly, the two lowest data points are exactly 7 days apart, which suggests that, in this dataset, there may be a periodicity of seven days in addition to the more obvious periodicity of one day.
Highest ValuesÂ¶
So what about the highest matrix profile values (filled triangles)? The subsequences that have the highest (local) values really emphasizes their uniqueness. We found that the top three peaks happened to correspond exactly with the timing of Columbus Day, Daylight Saving Time, and Thanksgiving, respectively.
Different Window SizesÂ¶
As we had mentioned above, stump
should be robust to the choice of the window size parameter, m
. Below, we demonstrate how manipulating the window size can have little impact on your resulting matrix profile by running stump
with varying windows sizes.
[16]:
days_dict ={
"HalfDay": 24,
"1Day": 48,
"2Days": 96,
"5Days": 240,
"7Days": 336,
}
days_df = pd.DataFrame.from_dict(days_dict, orient='index', columns=['m'])
days_df.head()
[16]:
m  

HalfDay  24 
1Day  48 
2Days  96 
5Days  240 
7Days  336 
We purposely chose spans of time that correspond to reasonably intuitive daylengths that could be chosen by a human.
[17]:
fig, axs = plt.subplots(5, sharex=True, gridspec_kw={'hspace': 0})
fig.text(0.5, 0.1, 'Subsequence Start Date', ha='center', fontsize='20')
fig.text(0.08, 0.5, 'Matrix Profile', va='center', rotation='vertical', fontsize='20')
for i, varying_m in enumerate(days_df['m'].values):
mp = stumpy.stump(taxi_df['value'], varying_m)
axs[i].plot(mp[:, 0])
axs[i].set_ylim(0,9.5)
axs[i].set_xlim(0,3600)
title = f"m = {varying_m}"
axs[i].set_title(title, fontsize=20, y=.5)
plt.xticks(np.arange(0, taxi_df.shape[0], (48*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.suptitle('STUMP with Varying Window Sizes', fontsize='30')
plt.show()
We can see that even with varying window sizes, our peaks stay prominent. But it looks as if all the nonpeak values are converging towards each other. This is why having a knowledge of the datacontext is important prior to running stump
, as it is helpful to have a window size that may capture a repeating pattern or anomaly within the dataset.
GPUSTUMP  Faster STUMP Using GPUsÂ¶
When you have significantly more than a few thousand data points in your time series, you may need a speed boost to help analyze your data. Luckily, you can try gpu_stump
, a super fast GPUpowered alternative to stump
that gives speed of a few hundred CPUs and provides the same output as stump
:
import stumpy
mp = stumpy.gpu_stump(df['value'], m=m) # Note that you'll need a properly configured NVIDIA GPU for this
In fact, if you arenâ€™t dealing with PII/SII data, then you can try out gpu_stump
using the this notebook on Google Colab.
STUMPED  Distributed STUMPÂ¶
Alternatively, if you only have access to a cluster of CPUs and your data needs to stay behind your firewall, then stump
and gpu_stump
may not be sufficient for your needs. Instead, you can try stumped
, a distributed and parallel implementation of stump
that depends on Dask distributed:
import stumpy
from dask.distributed import Client
dask_client = Client()
mp = stumpy.stumped(dask_client, df['value'], m=m) # Note that a dask client is needed
SummaryÂ¶
And thatâ€™s it! You have now loaded in a dataset, ran it through stump
using our package, and were able to extract multiple conclusions of existing patterns and anomalies within the two different time series. You can now import this package and use it in your own projects. Happy coding!
Time Series ChainsÂ¶
Forecasting Web Query Data with Anchored Time Series Chains (ATSC)Â¶
This example is adapted from the Web Query Volume case study and utilizes the main takeaways from the Matrix Profile VII research paper.
Getting StartedÂ¶
Letâ€™s import the packages that weâ€™ll need to load, analyze, and plot the data.
[1]:
%matplotlib inline
import pandas as pd
import numpy as np
import stumpy
from scipy.io import loadmat
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, FancyArrowPatch
import urllib
import ssl
import io
import itertools
What are Time Series Chains?Â¶
Time series chains may be informally considered as motifs that evolve or drift in some direction over time. The figure below illustrates the difference between time series motifs (left) and time series chains (right).
[2]:
def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
change_plot_size(20, 6, plt)
[3]:
x = np.random.rand(20)
y = np.random.rand(20)
n = 10
motifs_x = 0.5 * np.ones(n) + np.random.uniform(0.05, 0.05, n)
motifs_y = 0.5 * np.ones(n) + np.random.uniform(0.05, 0.05, n)
sin_x = np.linspace(0, np.pi/2, n+1)
sin_y = np.sin(sin_x)/4
chains_x = 0.5 * np.ones(n+1) + 0.02 * np.arange(n+1)
chains_y = 0.5 * np.ones(n+1) + sin_y
fig, axes = plt.subplots(nrows=1, ncols=2)
axes[0].scatter(x, y, color='lightgrey')
axes[0].scatter(motifs_x, motifs_y, color='red')
axes[1].scatter(x, y, color='lightgrey')
axes[1].scatter(chains_x[0], chains_y[0], edgecolor='red', color='white')
axes[1].scatter(chains_x[1:n], chains_y[1:n], color='red')
axes[1].scatter(chains_x[n], chains_y[n], edgecolor='red', color='white', marker='*', s=200)
[3]:
<matplotlib.collections.PathCollection at 0x11eaf9278>
Above, we are visualizing time series subsequences as points in highdimensional space. Shown on the left is a time series motif and it can be thought of as a collection of points that approximate a platonic ideal. In contrast, depicted on the right, is a time series chain and it may be thought of as an evolving trail of points in the space. Here, the open red circle represents the first link in the chain, the anchor. Both motifs and chains have the property that each subsequence is relatively close to its nearest neighbor. However, the motif set (left) also has a relatively small diameter. In contrast, the set of points in a chain (right) has a diameter that is much larger than the mean of each memberâ€™s distance to its nearest neighbor and, moreover, the chain has the important property of directionality. For example, in the case of a motif, if an additional member was added to the motif set, its location will also be somewhere near the platonic ideal, but independent of the previous subsequences. In contrast, in the case of a chain, the location of the next member of the chain would be somewhere after the last red circle, possibly where the open red star is located.
A Simplified ExampleÂ¶
Adapted from the Matrix Profile VII paper, consider the following time series:
47, 32, 1, 22, 2, 58, 3, 36, 4, 5, 5, 40
Assume that the subsequence length is 1 and the distance between two subsequences is simply the absolute difference between them. To be clear, we are making these simple and pathological assumptions here just for the purposes of elucidation; we are actually targeting much longer subsequence lengths and using znormalized Euclidean distance in our applications. To capture the directionality of a time series chain, we need to store the left and right nearest neighbor information into the left (IL) and right (IR) matrix profile indices:
Index  Value  Left Index (IL)  Right Index (IR) 

1  47  12  
2  32  1  8 
3  1  2  5 
4  22  2  8 
5  2  3  7 
6  58  1  12 
7  3  5  9 
8  36  2  12 
9  4  7  11 
10  5  3  11 
11  5  9  12 
12  40  8 
In this vertical/transposed representation, the index
column shows the location of every subsequence in the time series, the value
column contains the original numbers from our time series above, the IL
column shows the left matrix profile indices, and IR
is the right matrix profile indices. For example, IR[2] = 8
means the right nearest neighbor of index = 2
(which has value = 32
) is at index = 8
(which has value = 36
). Similarly, IL[3] = 2
means that the
left nearest neighbor of index = 3
(with value = 1
) is at index = 2
(which has value = 32
). To better visualize the left/right matrix profile index, we use arrows to link every subsequence in the time series with its left and right nearest neighbors:
[4]:
nearest_neighbors = np.array([[1, 47, np.nan, 12],
[2, 32, 1, 8],
[3, 1, 2, 5],
[4, 22, 2, 8],
[5, 2, 3, 7],
[6, 58, 1, 12],
[7, 3, 5, 9],
[8, 36, 2, 12],
[9, 4, 7, 11],
[10, 5, 3, 11],
[11, 5, 9, 12],
[12, 40, 8, np.nan]])
colors = [['C1', 'C1'],
['C2', 'C5'],
['C3', 'C5'],
['C4', 'C4'],
['C3', 'C2'],
['C5', 'C3'],
['C3', 'C2'],
['C2', 'C1'],
['C3', 'C2'],
['C6', 'C1'],
['C6', 'C2'],
['C1', 'C1']]
style="Simple, tail_width=0.5, head_width=6, head_length=8"
kw = dict(arrowstyle=style, connectionstyle="arc3, rad=.5",)
xs = np.arange(nearest_neighbors.shape[0]) + 1
ys = np.zeros(nearest_neighbors.shape[0])
plt.plot(xs, ys, "o", markerfacecolor="None", markeredgecolor="None", linestyle="None")
x0, x1, y0, y1 = plt.axis()
plot_margin = 5.0
plt.axis((x0  plot_margin,
x1 + plot_margin,
y0  plot_margin,
y1 + plot_margin))
plt.axis('off')
for x, y, nearest_neighbor, color in zip(xs, ys, nearest_neighbors, colors):
plt.text(x, y, str(int(nearest_neighbor[1])), color="black", fontsize=20)
# Plot right matrix profile indices
if not np.isnan(nearest_neighbor[3]):
arrow = FancyArrowPatch((x, 0.5), (nearest_neighbor[3], 0.5), color=color[0], **kw)
plt.gca().add_patch(arrow)
# Plot left matrix profile indices
if not np.isnan(nearest_neighbor[2]):
arrow = FancyArrowPatch((x, 0.0), (nearest_neighbor[2], 0.0), color=color[1], **kw)
plt.gca().add_patch(arrow)
An arrow pointing from a number to its right nearest neighbor (arrows shown above the time series) can be referred to as forward arrow and an arrow pointing from a number to its left nearest neighbor (arrows shown below the time series) can be referred to as a backward arrow. According to the formal definition of a time series chain (see Matrix Profile VII for a thorough definition and discussion), every pair of consecutive subsequences in a chain must be connected by both a forward arrow and a backward arrow. A keen eye will spot the fact that the longest chain in our simplified example is:
[5]:
nearest_neighbors = np.array([[1, 47, np.nan, np.nan],
[2, 32, np.nan, np.nan],
[3, 1, np.nan, 5],
[4, 22, np.nan, np.nan],
[5, 2, 3, 7],
[6, 58, np.nan, np.nan],
[7, 3, 5, 9],
[8, 36, np.nan, np.nan],
[9, 4, 7, 11],
[10, 5, np.nan, np.nan],
[11, 5, 9, np.nan],
[12, 40, np.nan, np.nan]])
colors = [['C1', 'C1'],
['C2', 'C5'],
['C3', 'C5'],
['C4', 'C4'],
['C3', 'C2'],
['C5', 'C3'],
['C3', 'C2'],
['C2', 'C1'],
['C3', 'C2'],
['C6', 'C1'],
['C6', 'C2'],
['C1', 'C1']]
style="Simple, tail_width=0.5, head_width=6, head_length=8"
kw = dict(arrowstyle=style, connectionstyle="arc3, rad=.5",)
xs = np.arange(nearest_neighbors.shape[0]) + 1
ys = np.zeros(nearest_neighbors.shape[0])
plt.plot(xs, ys, "o", markerfacecolor="None", markeredgecolor="None", linestyle="None")
x0, x1, y0, y1 = plt.axis()
plot_margin = 5.0
plt.axis((x0  plot_margin,
x1 + plot_margin,
y0  plot_margin,
y1 + plot_margin))
plt.axis('off')
for x, y, nearest_neighbor, color in zip(xs, ys, nearest_neighbors, colors):
plt.text(x, y, str(int(nearest_neighbor[1])), color="black", fontsize=20)
# Plot right matrix profile indices
if not np.isnan(nearest_neighbor[3]):
arrow = FancyArrowPatch((x, 0.5), (nearest_neighbor[3], 0.5), color=color[0], **kw)
plt.gca().add_patch(arrow)
# Plot left matrix profile indices
if not np.isnan(nearest_neighbor[2]):
arrow = FancyArrowPatch((x, 0.0), (nearest_neighbor[2], 0.0), color=color[1], **kw)
plt.gca().add_patch(arrow)
The longest extracted chain is therefore 1 â‡Œ 2 â‡Œ 3 â‡Œ 4 â‡Œ 5. Note that we see a gradual monotonic increase in the data but, in reality, the increase or decrease in drift can happen in arbitrarily complex ways that can be detected by the time series chains approach. The key component of drifting is that the time series must contain chains with clear directionality.
STUMPY is capable of computing:
 anchored time series chains (ATSC)  grow a chain from a userspecified anchor (i.e., specific subsequence)
 allchain set (ALLC)  a set of anchored time series chains (i.e., each chain starts with a particular subsequence) that are not subsumed by another longer chain
 unanchored time series chain(s)  the unconditionally longest chain within a time series (there could be more than one if there were chains with the same length)
So, what does this mean in the context of a real time series? Letâ€™s take a look at a real example from web query data!
Retrieve the DataÂ¶
We will be looking at a noisy dataset that is undersampled and has a growing trend, which will perfectly illustrate the idea regarding time series chains. The data contains a decadelong GoogleTrend query volume (collected weekly from 20042014) for the keyword Kohlâ€™s, an American retail chain. First, weâ€™ll download the data, extract it, and insert it into a pandas dataframe.
[6]:
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = 'https://sites.google.com/site/timeserieschain/home/Kohls_data.mat?attredirects=0&revision=1'
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
mat = loadmat(data)
mdata = mat['VarName1']
mdtype = mdata.dtype
df = pd.DataFrame(mdata, dtype=mdtype, columns=['volume'])
df.head()
[6]:
volume  

0  0.010417 
1  0.010417 
2  0.010417 
3  0.000000 
4  0.000000 
Visualizing the DataÂ¶
[7]:
plt.plot(df['volume'], color='black')
plt.xlim(0, df.shape[0]+12)
color = itertools.cycle(['white', 'gainsboro'])
for i, x in enumerate(range(0, df.shape[0], 52)):
plt.text(x+12, 0.9, str(2004+i), color="black", fontsize=20)
rect = Rectangle((x, 1), 52, 2.5, facecolor=next(color))
plt.gca().add_patch(rect)
The raw time series above displays ten years of web query volume for the keyword â€œKohlâ€™sâ€, where each alternating white and grey vertical band represents a 52 week period starting from 2004 to 2014. As depicted, the time series features a significant but unsurprising â€œendofyear holiday bumpâ€. Relating back to time series chains, we can see that the bump is generally increasing over time and so we might be able to capture this when we compute the unanchored chain.
However, as we learned above, in order to compute any time series chains, we also need the left and right matrix profile indices. Luckily for us, according to the docstring, the stump
function not only returns the (bidirectional) matrix profile and the matrix profile indices in the first and second columns of the NumPy array, respectively, but the third and fourth columns consists of the left matrix profile indices and the right matrix profile indices, respectively:
[8]:
?stumpy.stump
Signature: stumpy.stump(T_A, m, T_B=None, ignore_trivial=True)
Docstring:
Compute the matrix profile with parallelized STOMP
This is a convenience wrapper around the Numba JITcompiled parallelized
`_stump` function which computes the matrix profile according to STOMP.
Parameters

T_A : ndarray
The time series or sequence for which to compute the matrix profile
m : int
Window size
T_B : ndarray
The time series or sequence that contain your query subsequences
of interest. Default is `None` which corresponds to a selfjoin.
ignore_trivial : bool
Set to `True` if this is a selfjoin. Otherwise, for ABjoin, set this
to `False`. Default is `True`.
Returns

out : ndarray
The first column consists of the matrix profile, the second column
consists of the matrix profile indices, the third column consists of
the left matrix profile indices, and the fourth column consists of
the right matrix profile indices.
Notes

DOI: 10.1109/ICDM.2016.0085
See Table II
Timeseries, T_B, will be annotated with the distance location
(or index) of all its subsequences in another times series, T_A.
Return: For every subsequence, Q, in T_B, you will get a distance
and index for the closest subsequence in T_A. Thus, the array
returned will have length T_B.shape[0]m+1. Additionally, the
left and right matrix profiles are also returned.
Note: Unlike in the Table II where T_A.shape is expected to be equal
to T_B.shape, this implementation is generalized so that the shapes of
T_A and T_B can be different. In the case where T_A.shape == T_B.shape,
then our algorithm reduces down to the same algorithm found in Table II.
Additionally, unlike STAMP where the exclusion zone is m/2, the default
exclusion zone for STOMP is m/4 (See Definition 3 and Figure 3).
For selfjoins, set `ignore_trivial = True` in order to avoid the
trivial match.
Note that left and right matrix profiles are only available for selfjoins.
File: ~/miniconda3/lib/python3.7/sitepackages/stumpy1.1.0py3.7.egg/stumpy/stump.py
Type: function
Computing the Left and Right Matrix Profile IndicesÂ¶
So, letâ€™s go ahead and compute the matrix profile indices and weâ€™ll set the window size, m = 20
, which is the approximate length of a â€œbumpâ€.
[9]:
m = 20
mp = stumpy.stump(df['volume'], m=m)
Computing the Unanchored ChainÂ¶
Now, with our left and right matrix profile indices in hand, we are ready to call the allchain set function, allc
, which not only returns the allchain set but, as a freebie, it also returns the unconditionally longest chain, also know as the unanchored chain. The latter of which is really what weâ€™re most interested in.
[10]:
all_chain_set, unanchored_chain = stumpy.allc(mp[:, 2], mp[:, 3])
Visualizing the Unanchored ChainÂ¶
[11]:
plt.plot(df['volume'], linewidth=1, color='black')
for i in range(unanchored_chain.shape[0]):
y = df['volume'].iloc[unanchored_chain[i]:unanchored_chain[i]+m]
x = y.index.values
plt.plot(x, y, linewidth=3)
color = itertools.cycle(['white', 'gainsboro'])
for i, x in enumerate(range(0, df.shape[0], 52)):
plt.text(x+12, 0.9, str(2004+i), color="black", fontsize=20)
rect = Rectangle((x, 1), 52, 2.5, facecolor=next(color))
plt.gca().add_patch(rect)
[12]:
plt.axis('off')
for i in range(unanchored_chain.shape[0]):
data = df['volume'].iloc[unanchored_chain[i]:unanchored_chain[i]+m].reset_index().values
x = data[:, 0]
y = data[:, 1]
plt.axvline(x=x[0]x.min()+(m+5)*i + 11, alpha=0.3)
plt.axvline(x=x[0]x.min()+(m+5)*i + 15, alpha=0.3, linestyle='.')
plt.plot(xx.min()+(m+5)*i, yy.min(), linewidth=3)
The discovered chain shows that over the decade, the bump transitions from a smooth bump covering the period between Thanksgiving (solid vertical line) and Christmas (dashed vertical line), to a more sharply focused bump centered on Thanksgiving. This seems to reflect the growing importance of â€œCyber Mondayâ€, a marketing term for the Monday after Thanksgiving. The phrase was created by marketing companies to persuade consumers to shop online. The term made its debut on November 28th, 2005 in a press release entitled â€œCyber Monday Quickly Becoming One of the Biggest Online Shopping Days of the Yearâ€. Note that this date coincides with the first glimpse of the sharpening peak in our chain.
It also appears that we may have â€œmissedâ€ a few links in the chain. However, note that the data is noisy and undersampled, and the â€œmissedâ€ bumps are too distorted to conform with the general evolving trend. This noisy example actually illustrates the robustness of the time series chains technique. As noted before, we donâ€™t actually need â€œperfectâ€ data in order to find meaningful chains. Even if some links are badly distorted, the discovered chain will still be able to include all of the other evolving patterns.
One final consideration is the potential use of chains to predict the future. One could leverage the evolving links within the chains in order to forecast the shape of the next bump. We refer the reader to the Matrix Profile VII for further discussions on this topic.
SummaryÂ¶
And thatâ€™s it! Youâ€™ve just learned the basics of how to identify directional trends, also known as chains, within your data using the matrix profile indices and leveraging allc
.
Semantic SegmentationÂ¶
Analyzing Arterial Blood Pressure Data with FLUSS and FLOSSÂ¶
This example utilizes the main takeaways from the Matrix Profile VIII research paper. For proper context, we highly recommend that you read the paper first but know that our implementations follow this paper closely.
According to the aforementioned publication, â€œone of the most basic analyses one can perform on [increasing amounts of time series data being captured] is to segment it into homogeneous regions.â€ In other words, wouldnâ€™t it be nice if you could take your long time series data and be able to segment or chop it up into k
regions (where k
is small) and with the ultimate goal of presenting only k
short representative patterns to a human (or machine) annotator in order to produce labels
for the entire dataset. These segmented regions are also known as â€œregimesâ€. Additionally, as an exploratory tool, one might uncover new actionable insights in the data that was previously undiscovered. Fast lowcost unipotent semantic segmentation (FLUSS) is an algorithm that produces something called an â€œarc curveâ€ which annotates the raw time series with information about the likelihood of a regime change. Fast lowcost online semantic segmentation (FLOSS) is a variation of FLUSS that,
according to the original paper, is domain agnostic, offers streaming capabilities with potential for actionable realtime intervention, and is suitable for real world data (i.e., does not assume that every region of the data belongs to a welldefined semantic segment).
To demonstrate the API and underlying principles, we will be looking at arterial blood pressure (ABP) data from from a healthy volunteer resting on a medical tilt table and will be seeing if we can detect when the table is tilted from a horizontal position to a vertical position. This is the same data that is presented throughout the original paper (above).
Getting StartedÂ¶
Letâ€™s import the packages that weâ€™ll need to load, analyze, and plot the data.
[1]:
%matplotlib inline
import pandas as pd
import numpy as np
import stumpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, FancyArrowPatch
from matplotlib import animation
from IPython.display import HTML
import urllib
import ssl
import io
import os
Retrieve the DataÂ¶
[2]:
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
url = 'https://sites.google.com/site/timeserieschain/home/TiltABP_210_25000.txt'
raw_bytes = urllib.request.urlopen(url, context=context).read()
data = io.BytesIO(raw_bytes)
df = pd.read_csv(data, header=None)
df = df.reset_index().rename({'index': 'time', 0: 'abp'}, axis='columns')
df.head()
[2]:
time  abp  

0  0  6832.0 
1  1  6928.0 
2  2  6968.0 
3  3  6992.0 
4  4  6980.0 
Visualizing the Raw DataÂ¶
[3]:
def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
change_plot_size(20, 6, plt)
[4]:
plt.plot(df['time'], df['abp'])
rect = Rectangle((24000,2400),2000,6000,facecolor='lightgrey')
plt.gca().add_patch(rect)
[4]:
<matplotlib.patches.Rectangle at 0x1a27cd0310>
We can clearly see that there is a change around time=25000
that corresponds to when the table was tilted upright.
FLUSSÂ¶
Instead of using the full dataset, letâ€™s zoom in and analyze the 2,500 data points directly before and after x=25000
(see Figure 5 in the paper).
[5]:
start = 25000  2500
stop = 25000 + 2500
abp = df.iloc[start:stop, 1]
plt.plot(range(abp.shape[0]), abp)
plt.ylim(2800, 8500)
plt.axvline(x=2373, linestyle="dashed")
style="Simple, tail_width=0.5, head_width=6, head_length=8"
kw = dict(arrowstyle=style, color="k")
# regime 1
rect = Rectangle((55,2500), 225, 6000, facecolor='lightgrey')
plt.gca().add_patch(rect)
rect = Rectangle((470,2500), 225, 6000, facecolor='lightgrey')
plt.gca().add_patch(rect)
rect = Rectangle((880,2500), 225, 6000, facecolor='lightgrey')
plt.gca().add_patch(rect)
rect = Rectangle((1700,2500), 225, 6000, facecolor='lightgrey')
plt.gca().add_patch(rect)
arrow = FancyArrowPatch((75, 7000), (490, 7000), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
arrow = FancyArrowPatch((495, 7000), (905, 7000), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
arrow = FancyArrowPatch((905, 7000), (495, 7000), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
arrow = FancyArrowPatch((1735, 7100), (490, 7100), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
# regime 2
rect = Rectangle((2510,2500), 225, 6000, facecolor='moccasin')
plt.gca().add_patch(rect)
rect = Rectangle((2910,2500), 225, 6000, facecolor='moccasin')
plt.gca().add_patch(rect)
rect = Rectangle((3310,2500), 225, 6000, facecolor='moccasin')
plt.gca().add_patch(rect)
arrow = FancyArrowPatch((2540, 7000), (3340, 7000), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
arrow = FancyArrowPatch((2960, 7000), (2540, 7000), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
arrow = FancyArrowPatch((3340, 7100), (3540, 7100), connectionstyle="arc3, rad=.5", **kw)
plt.gca().add_patch(arrow)
[5]:
<matplotlib.patches.FancyArrowPatch at 0x1a282c9c90>
Roughly, in the truncated plot above, we see that the segmentation between the two regimes occurs around time=2373
(vertical dotted line) where the patterns from the first regime (grey) donâ€™t cross over to the second regime (orange) (see Figure 2 in the original paper). And so the â€œarc curveâ€ is calculated by sliding along the time series and simply counting the number of times other patterns have â€œcrossed overâ€ that specific time point (i.e., â€œarcsâ€). Essentially, this information can be
extracted by looking at the matrix profile indices (which tells you where along the time series your nearest neighbor is). And so, weâ€™d expect the arc counts to be high where repeated patterns are near each other and low where there are no crossing arcs.
Before we compute the â€œarc curveâ€, weâ€™ll need to first compute the standard matrix profile and we can approximately see that the window size is about 210 data points (thanks to the knowledge of the subject matter/domain expert).
[6]:
m = 210
mp = stumpy.stump(abp, m=m)
Now, to compute the â€œarc curveâ€ and determine the location of the regime change, we can directly call the fluss
function. However, note that fluss
requires the following inputs:
 the matrix profile indices
mp[:, 1]
(not the matrix profile distances)  an appropriate subsequence length,
L
(for convenience, weâ€™ll just choose it to be equal to the window size,m=210
)  the number of regimes,
n_regimes
, to search for (2 regions in this case)  an exclusion factor,
excl_factor
, to nullify the beginning and end of the arc curve (anywhere between 15 is reasonable according to the paper)
[7]:
L = 210
cac, regime_locations = stumpy.fluss(mp[:, 1], L=L, n_regimes=2, excl_factor=1)
Notice that fluss
actually returns something called the â€œcorrected arc curveâ€ (CAC), which normalizes the fact that there are typically less arcs crossing over a time point near the beginning and end of the time series and more potential for cross overs near the middle of the time series. Additionally, fluss
returns the regimes or location(s) of the dotted line(s). Letâ€™s plot our original time series (top) along with the corrected arc curve (orange) and the single regime (vertical dotted
line).
[8]:
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
axs[0].plot(range(abp.shape[0]), abp)
axs[0].axvline(x=regime_locations[0], linestyle="dashed")
axs[1].plot(range(cac.shape[0]), cac, color='C1')
axs[1].axvline(x=regime_locations[0], linestyle="dashed")
[8]:
<matplotlib.lines.Line2D at 0x1a2a9dc5d0>
FLOSSÂ¶
Unlike FLUSS, FLOSS is concerned with streaming data, and so it calculates a modified version of the corrected arc curve (CAC) that is strictly onedirectional (CAC_1D) rather than bidirectional. That is, instead of expecting cross overs to be equally likely from both directions, we expect more cross overs to point toward the future (and less to point toward the past). So, we can manually compute the CAC_1D
[9]:
cac_1d = stumpy._cac(mp[:, 3], L, bidirectional=False, excl_factor=1) # This is for demo purposes only. Use floss() below!
and compare the CAC_1D
(blue) with the bidirectional CAC
(orange) and we see that the global minimum are approximately in the same place (see Figure 10 in the original paper).
[10]:
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
axs[0].plot(np.arange(abp.shape[0]), abp)
axs[0].axvline(x=regime_locations[0], linestyle="dashed")
axs[1].plot(range(cac.shape[0]), cac, color='C1')
axs[1].axvline(x=regime_locations[0], linestyle="dashed")
axs[1].plot(range(cac_1d.shape[0]), cac_1d)
[10]:
[<matplotlib.lines.Line2D at 0x1a2a4c34d0>]
Streaming Data with FLOSSÂ¶
Instead of manually computing CAC_1D
like we did above on streaming data, we can actually call the floss
function directly which instantiates a streaming object. To demonstrate the use of floss
, letâ€™s take some old_data
and compute the matrix profile indices for it like we did above:
[11]:
old_data = df.iloc[20000:20000+5000, 1].values # This is well before the regime change has occurred
mp = stumpy.stump(old_data, m=m)
Now, we could do what we did early and compute the bidirectional corrected arc curve but weâ€™d like to see how the arc curve changes as a result of adding new data points. So, letâ€™s define some new data that is to be streamed in:
[12]:
new_data = df.iloc[25000:25000+5000, 1].values
Finally, we call the floss
function to initialize a streaming object and pass in:
 the matrix profile generated from the
old_data
(only the indices are used)  the old data used to generate the matrix profile in 1.
 the matrix profile window size,
m=210
 the subsequence length,
L=210
 the exclusion factor
[13]:
stream = stumpy.floss(mp, old_data, m=m, L=L, excl_factor=1)
You can now update the stream
with a new data point, t
, via the stream.update(t)
function and this will slide your window over by one data point and it will automatically update:
 the
CAC_1D
(accessed via the.cac_1d_
attribute)  the matrix profile (accessed via the
.P_
attribute)  the matrix profile indices (accessed via the
.I_
attribute)  the sliding window of data used to produce the
CAC_1D
(accessed via the.T_
attribute  this should be the same size as the length of the `old_data)
Letâ€™s continuously update our stream
with the new_data
one value at a time and store them in a list (youâ€™ll see why in a second):
[14]:
windows = []
for i, t in enumerate(new_data):
stream.update(t)
if i % 100 == 0:
windows.append((stream.T_, stream.cac_1d_))
Below, you can see an animation that changes as a result of updating the stream with new data. For reference, weâ€™ve also plotted the CAC_1D
(orange) that we manually generated from above for the stationary data. Youâ€™ll see that halfway through the animation, the regime change occurs and the updated CAC_1D
(blue) will be perfectly aligned with the orange curve.
[15]:
fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
axs[0].set_xlim((0, mp.shape[0]))
axs[0].set_ylim((0.1, max(np.max(old_data), np.max(new_data))))
axs[1].set_xlim((0, mp.shape[0]))
axs[1].set_ylim((0.1, 1.1))
lines = []
for ax in axs:
line, = ax.plot([], [], lw=2)
lines.append(line)
line, = axs[1].plot([], [], lw=2)
lines.append(line)
def init():
for line in lines:
line.set_data([], [])
return lines
def animate(window):
data_out, cac_out = window
for line, data in zip(lines, [data_out, cac_out, cac_1d]):
line.set_data(np.arange(data.shape[0]), data)
return lines
anim = animation.FuncAnimation(fig, animate, init_func=init,
frames=windows, interval=100,
blit=True)
anim_out = anim.to_jshtml()
plt.close() # Prevents duplicate image from displaying
if os.path.exists("None0000000.png"):
os.remove("None0000000.png") # Delete rogue temp file
HTML(anim_out)
# anim.save('/tmp/semantic.mp4')
[15]: