# The Matrix Profile¶

## Laying the Foundation¶

At its core, the STUMPY library efficiently computes something called a matrix profile, a vector that stores the z-normalized Euclidean distance between any subsequence within a time series and its nearest neigbor.

To fully understand what this means, let’s take a step back and start with a simple illustrative example along with a few basic definitions:

## Time Series with Length n = 13¶

:

time_series = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]
n = len(time_series)


To analyze this time series with length n = 13, we could visualize the data or calculate global summary statistics (i.e., mean, median, mode, min, max). If you had a much longer time series, then you may even feel compelled to build an ARIMA model, perform anomaly detection, or attempt a forecasting model but these methods can be complicated and may often have false positives or no interpretable insights. However, if we were to apply Occam’s Razor, then what is the most simple and intuitive approach that we could take analyze to this time series?

## Subsequence /ˈsəbsəkwəns/ noun¶

### a part or section of the full time series¶

So, the following are all considered subsequences of our time_series since they can all be found in the time series above.   :

print(time_series[0:2])
print(time_series[4:7])
print(time_series[2:10])

[0, 1]
[9, 1, 14]
[3, 2, 9, 1, 14, 15, 1, 2]


We can see that each subsequence can have a different sequence length that we’ll call m. So, for example, if we choose m = 4, then we can think about how we might compare any two subsequences of the same length.

:

m = 4
i = 0  # starting index for the first subsequence
j = 8  # starting index for the second subsequence

subseq_1 = time_series[i:i+m]
subseq_2 = time_series[j:j+m]

print(subseq_1, subseq_2)

[0, 1, 3, 2] [1, 2, 2, 10] One way to compare any two subsequences is to calculate what is called the Euclidean distance.

## Euclidean Distance /yo͞oˈklidēən/ /ˈdistəns/ noun¶

### the straight-line distance between two points¶ :

import math

D = 0
for k in range(m):
D += (time_series[i+k] - time_series[j+k])**2
print(f"The square root of {D} = {math.sqrt(D)}")

The square root of 67 = 8.18535277187245


## Distance Profile - Pairwise Euclidean Distances¶

Now, we can take this a step further where we keep one subsequence the same (reference subsequence), change the second subsequence in a sliding window manner, and compute the Euclidean distance for each window. The resulting vector of pairwise Euclidean distances is also known as a distance profile. Of course, not all of these distances are useful. Specifically, the distance for the self match (or trivial match) isn’t informative since the distance will be always be zero when you are comparing a subsequence with itself. So, we’ll ignore it and, instead, take note of the next smallest distance from the distance profile and choose that as our best match: Next, we can shift our reference subsequence over one element at a time and repeat the same sliding window process to compute the distance profile for each new reference subsequence. ## Distance Matrix¶

If we take all of the distance profiles that were computed for each reference subsequence and stack them one on top of each other then we get something called a distance matrix ## The Real Problem - The Brute Force Approach¶

Now, it might seem pretty straightforward at this point but what we need to do is consider how to compute this full distance matrix efficiently. Let’s start with the brute force approach:

:

for i in range(n-m+1):
for j in range(n-m+1):
D = 0
for k in range(m):
D += (time_series[i+k] - time_series[j+k])**2
D = math.sqrt(D)


At first glance, this may not look too bad but if we start considering both the computational complexity as well as the spatial complexity then we begin to understand the real problem. It turns out that, for longer time series (i.e., n >> 10,000) the computational complexity is O(n2m) (as evidenced by the three for loops in the code above) and the spatial complexity for storing the full distance matrix is O(n2).

To put this into perspective, imagine if you had a single sensor that collected data 20 times/min over the course of 5 years. This would result:

:

n = 20 * 60 * 24 * 364 * 5  # 20 times/min x 60 mins/hour x 24 hours/day x 365 days/year x 5 years
print(f"There would be n = {n} data points")

There would be n = 52416000 data points


Assuming that each calculation in the inner loop takes 0.0000001 seconds then this would take:

:

time = 0.0000001 * (n * n - n)/2
print(f"It would take {time} seconds to compute")

It would take 137371850.1792 seconds to compute


Which is equivalent to 1,598.7 days (or 4.4 years) and 11.1 PB of memory to compute! So, it is clearly not feasible to compute the distance matrix using our naive brute force method. And this is where the matrix profile comes into play. Recall,

## Matrix Profile /ˈmātriks/ /ˈprōˌfīl/ noun¶

### a vector that stores the (z-normalized) Euclidean distance between any subsequence within a time series and its nearest neigbor¶

Practically, what this means is that the matrix profile is only interested in storing the smallest non-trivial distances from each distance profile, which significantly reduces the spatial complexity to O(n): But we still haven’t figured out how to reduce the computational complexity and this is precisely the problem that is being addressed by STUMPY.

## STUMPY¶

In the fall of 2016, researchers from the University of California, Riverside and the University of New Mexico published a beautiful set of back-to-back papers that described an exact method called STOMP for computing the matrix profile for any time series with a computational complexity of O(n2)! They also further demonstrated this using GPUs and this method approach is called GPU-STOMP.

With the academics, data scientists, and developers in mind, we have taken some of these concepts and have open sourced STUMPY, a powerful and scalable library that efficiently computes the matrix profile. And, thanks to other open source software such as Numba and Dask, our implementation can be both highly parallelized (for a single server) and highly distributed (across multiple servers). We’ve tested STUMPY on as many as 256 CPU cores that are spread across 32 servers and have achieved similiar performance to the published GPU-STOMP work.

## Conclusion¶

According to the original authors, “these are the best ideas in times series data mining in the last two decades” and “given the matrix profile, most time series data mining problems are trivial to solve in a few lines of code”.

From our experience, this is definitely true and we are excited to share STUMPY with you! Please reach out and tell us know how STUMPY has enabled your time series analysis work as we’d love to hear from you!