The Matrix Profile

Laying the Foundation

At its core, the STUMPY library efficiently computes something called a matrix profile, a vector that stores the z-normalized Euclidean distance between any subsequence within a time series and its nearest neigbor.

To fully understand what this means, let’s take a step back and start with a simple illustrative example along with a few basic definitions:

Time Series with Length n = 13

[1]:
time_series = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]
n = len(time_series)

To analyze this time series with length n = 13, we could visualize the data or calculate global summary statistics (i.e., mean, median, mode, min, max). If you had a much longer time series, then you may even feel compelled to build an ARIMA model, perform anomaly detection, or attempt a forecasting model but these methods can be complicated and may often have false positives or no interpretable insights.

Time Series Visualization

However, if we were to apply Occam’s Razor, then what is the most simple and intuitive approach that we could take analyze to this time series?

To answer this question, let’s start with our first defintion:

Subsequence /ˈsəbsəkwəns/ noun

a part or section of the full time series

So, the following are all considered subsequences of our time_series since they can all be found in the time series above.

Subsequence 1 Subsequence 2 Subsequence 3

[2]:
print(time_series[0:2])
print(time_series[4:7])
print(time_series[2:10])
[0, 1]
[9, 1, 14]
[3, 2, 9, 1, 14, 15, 1, 2]

We can see that each subsequence can have a different sequence length that we’ll call m. So, for example, if we choose m = 4, then we can think about how we might compare any two subsequences of the same length.

[3]:
m = 4
i = 0  # starting index for the first subsequence
j = 8  # starting index for the second subsequence

subseq_1 = time_series[i:i+m]
subseq_2 = time_series[j:j+m]

print(subseq_1, subseq_2)
[0, 1, 3, 2] [1, 2, 2, 10]

Subsequences

One way to compare any two subsequences is to calculate what is called the Euclidean distance.

Euclidean Distance /yo͞oˈklidēən/ /ˈdistəns/ noun

the straight-line distance between two points

Euclidean Distance

[4]:
import math

D = 0
for k in range(m):
    D += (time_series[i+k] - time_series[j+k])**2
print(f"The square root of {D} = {math.sqrt(D)}")
The square root of 67 = 8.18535277187245

Distance Profile - Pairwise Euclidean Distances

Now, we can take this a step further where we keep one subsequence the same (reference subsequence), change the second subsequence in a sliding window manner, and compute the Euclidean distance for each window. The resulting vector of pairwise Euclidean distances is also known as a distance profile.

Pairwise Euclidean Distance

Of course, not all of these distances are useful. Specifically, the distance for the self match (or trivial match) isn’t informative since the distance will be always be zero when you are comparing a subsequence with itself. So, we’ll ignore it and, instead, take note of the next smallest distance from the distance profile and choose that as our best match:

Trivial Match

Next, we can shift our reference subsequence over one element at a time and repeat the same sliding window process to compute the distance profile for each new reference subsequence.

Distance Profiles

Distance Matrix

If we take all of the distance profiles that were computed for each reference subsequence and stack them one on top of each other then we get something called a distance matrix

Distance Matrix

The Real Problem - The Brute Force Approach

Now, it might seem pretty straightforward at this point but what we need to do is consider how to compute this full distance matrix efficiently. Let’s start with the brute force approach:

[5]:
for i in range(n-m+1):
    for j in range(n-m+1):
        D = 0
        for k in range(m):
            D += (time_series[i+k] - time_series[j+k])**2
        D = math.sqrt(D)

At first glance, this may not look too bad but if we start considering both the computational complexity as well as the spatial complexity then we begin to understand the real problem. It turns out that, for longer time series (i.e., n >> 10,000) the computational complexity is O(n2m) (as evidenced by the three for loops in the code above) and the spatial complexity for storing the full distance matrix is O(n2).

To put this into perspective, imagine if you had a single sensor that collected data 20 times/min over the course of 5 years. This would result:

[6]:
n = 20 * 60 * 24 * 364 * 5  # 20 times/min x 60 mins/hour x 24 hours/day x 365 days/year x 5 years
print(f"There would be n = {n} data points")
There would be n = 52416000 data points

Assuming that each calculation in the inner loop takes 0.0000001 seconds then this would take:

[7]:
time = 0.0000001 * (n * n - n)/2
print(f"It would take {time} seconds to compute")
It would take 137371850.1792 seconds to compute

Which is equivalent to 1,598.7 days (or 4.4 years) and 11.1 PB of memory to compute! So, it is clearly not feasible to compute the distance matrix using our naive brute force method. And this is where the matrix profile comes into play. Recall,

Matrix Profile /ˈmātriks/ /ˈprōˌfīl/ noun

a vector that stores the (z-normalized) Euclidean distance between any subsequence within a time series and its nearest neigbor

Practically, what this means is that the matrix profile is only interested in storing the smallest non-trivial distances from each distance profile, which significantly reduces the spatial complexity to O(n):

Matrix Profile

But we still haven’t figured out how to reduce the computational complexity and this is precisely the problem that is being addressed by STUMPY.

STUMPY

In the fall of 2016, researchers from the University of California, Riverside and the University of New Mexico published a beautiful set of back-to-back papers that described an exact method called STOMP for computing the matrix profile for any time series with a computational complexity of O(n2)! They also further demonstrated this using GPUs and this method approach is called GPU-STOMP.

With the academics, data scientists, and developers in mind, we have taken some of these concepts and have open sourced STUMPY, a powerful and scalable library that efficiently computes the matrix profile. And, thanks to other open source software such as Numba and Dask, our implementation can be both highly parallelized (for a single server) and highly distributed (across multiple servers). We’ve tested STUMPY on as many as 256 CPU cores that are spread across 32 servers and have achieved similiar performance to the published GPU-STOMP work.

Conclusion

According to the original authors, “these are the best ideas in times series data mining in the last two decades” and “given the matrix profile, most time series data mining problems are trivial to solve in a few lines of code”.

From our experience, this is definitely true and we are excited to share STUMPY with you! Please reach out and tell us know how STUMPY has enabled your time series analysis work as we’d love to hear from you!