# Fast Pattern Searching¶

## Beyond Matrix Profiles¶

At the core of STUMPY, one can take any time series data and efficiently compute something called a matrix profile, which essentially scans along your entire times series with a fixed window size, `m`

, and finds the exact nearest neighbor for every subsequence within your time series. A matrix profile allows you to determine if there are any conserved behaviors (i.e., conserved subsequences/patterns) within your
data and, if so, it can tell you exactly where they are located within your time series. In a previous tutorial, we demonstrated how to use STUMPY to easily obtain a matrix profile, learned how to interpret the results, and discover meaningful motifs and discords. While this brute-force approach may be very useful when you don’t know what pattern or conserved behavior you are looking but, for sufficiently large datasets,
it can become quite expensive to perform this exhaustive pairwise search.

However, if you already have a specific user defined pattern in mind then you don’t actually need to compute the full matrix profile! For example, maybe you’ve identified an interesting trading strategy based on historical stock market data and you’d like to see if that specific pattern may have been observed in the past within one or more stock ticker symbols. In that case, searching for a known pattern or “query” is actually quite straightforward and can be accomplished quickly by using the
wonderful `core.mass`

function in STUMPY.

In this short tutorial, we’ll take a simple known pattern of interest (i.e., a query subsequence) and we’ll search for this pattern in a separate independent time series. Let’s get started!

## Getting Started¶

Let’s import the packages that we’ll need to load, analyze, and plot the data

```
[1]:
```

```
%matplotlib inline
import pandas as pd
import stumpy
import numpy as np
import numpy.testing as npt
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import urllib
import ssl
import io
import os
import time
plt.rcParams["figure.figsize"] = [20, 6] # width, height
plt.rcParams['xtick.direction'] = 'out'
```

## Loading the Sony AIBO Robot Dog Dataset¶

The time series data (below), `T_df`

, has `n = 13000`

data points and it was collected from an accelerometer inside of a Sony AIBO robot dog where it tracked the robot dog when it was walking from a cement surface onto a carpeted surface and, finally, back to the cement surface:

```
[2]:
```

```
context = ssl.SSLContext() # Ignore SSL certificate verification for simplicity
T_url = 'https://www.cs.unm.edu/~mueen/robot_dog.txt'
T_raw_bytes = urllib.request.urlopen(T_url, context=context).read()
T_data = io.BytesIO(T_raw_bytes)
T_df = pd.read_csv(T_data, header=None, sep='\s+', names=['Acceleration'])
T_df.head()
```

```
[2]:
```

Acceleration | |
---|---|

0 | 0.89969 |

1 | 0.89969 |

2 | 0.89969 |

3 | 0.89969 |

4 | 0.89969 |

## Visualizing the Sony AIBO Robot Dog Dataset¶

```
[3]:
```

```
plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)
plt.show()
```

In the plot above, the periods of time when the robot dog was walking on cement is displayed with a white background while the times when the robot dog was walking on carpet is highlighted with a grey background. Do you notice any appreciable difference(s) between walking on the different surfaces? Are there any interesting insights that you can observe with the human eye? Do any conserved patterns exist within this time series and, if so, where are they?

## Have You Seen This Pattern?¶

The subsequence pattern or query (below) that we are interested in searching for in the time series (above) looks like this:

```
[4]:
```

```
Q_url = 'https://www.cs.unm.edu/~mueen/carpet_query.txt'
Q_raw_bytes = urllib.request.urlopen(Q_url, context=context).read()
Q_data = io.BytesIO(Q_raw_bytes)
Q_df = pd.read_csv(Q_data, header=None, sep='\s+', names=['Acceleration'])
plt.suptitle('Pattern or Query Subsequence, Q_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(Q_df, lw=2, color="C1") # Walking on cement
plt.show()
```

This pattern, `Q_df`

, has a window length of `m = 100`

and it was taken from a completely independent walking sample. Does it look familiar at all? Does a similar pattern exist in our earlier time series, `T_df`

? Can you tell which surface the robot dog was walking on when this query sample was collected?

To answer some of these questions, you can compare this specific query subsequence or pattern with the full time series by computing something called a “distance profile”. Essentially, you take this single query, `Q_df`

, and compare it to every single subsequence in `T_df`

by computing all possible (z-normalized Euclidean) pairwise distances. So, the distance profile is simply a 1-dimensional vector that tells you exactly how similar/dissimilar `Q_df`

is to every subsequence (of the same
length) found in `T_df`

. Now, a naive algorithm for computing the distance profile would take `O(n*m)`

time to process but, luckily, we can do much better than this as there exists a super efficient approach called “Mueen’s Algorithm for Similarity Search” (MASS) that is able to compute the distance profile in much faster `O(n*log(n))`

time (`log`

base 2). Now, this may not be a big deal if you only have a few short time
series to analyze but if you need to repeat this process many times with different query subsequences then things can add up quickly. In fact, as the length of the time series, `n`

, and/or the length of the query subsequence, `m`

, gets much longer, the naive algorithm would take way too much time!

## Computing the Distance Profile with MASS¶

So, given a query subsequence, `Q_df`

, and a time series, `T_df`

, we can perform a fast similarity search and compute a distance profile using the `core.mass`

function in STUMPY:

```
[5]:
```

```
distance_profile = stumpy.core.mass(Q_df["Acceleration"], T_df["Acceleration"])
```

And, since the `distance_profile`

contains the full list of pairwise distances between `Q_df`

and every subsequence within `T_df`

, we can retrieve the most similar subsequence from `T_df`

by finding the smallest distance value in `distance_profile`

and extracting its positional index:

```
[6]:
```

```
idx = np.argmin(distance_profile)
print(f"The nearest neighbor to `Q_df` is located at index {idx} in `T_df`")
```

```
The nearest neighbor to `Q_df` is located at index 7479 in `T_df`
```

So, to answer our earlier question of “Does a similar pattern exist in our earlier time series, T_df?”, let’s go ahead and plot the most similar subsequence in `T_df`

, which is located at index 7479 (blue), and overlay this with our query pattern, `Q_df`

, (orange):

```
[7]:
```

```
# Since MASS computes z-normalized Euclidean distances, we should z-normalize our subsequences before plotting
Q_z_norm = stumpy.core.z_norm(Q_df.values)
T_z_norm = stumpy.core.z_norm(T_df.values[idx:idx+len(Q_df)])
plt.suptitle('Comparing The Query To Its Nearest Neighbor', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(Q_z_norm, lw=2, color="C1", label="Query Subsequence, Q_df")
plt.plot(T_z_norm, lw=2, label="Nearest Neighbor Subsequence From T_df")
plt.legend()
plt.show()
```

Notice that even though the query subsequence does not perfectly match its nearest neighbor, STUMPY was still able to find it! And then, to answer the second question of “Can you tell which surface the robot dog was walking on when this query sample was collected?”, we can look at precisely where `idx`

is located within `T_df`

:

```
[8]:
```

```
plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)
plt.plot(range(idx, idx+len(Q_df)), T_df.values[idx:idx+len(Q_df)], lw=2, label="Nearest Neighbor Subsequence")
plt.legend()
plt.show()
```

As we can see above, the nearest neighbor (orange) to `Q_df`

is a subsequence that is found when the robot dog was walking on carpet and, as it turns out, the `Q_df`

was collected from an independent sample where the robot dog was walking on carpet too! To take this a step further, instead of extracting the only top nearest neighbor, we can look at where the top `k = 16`

nearest neighbors are located:

```
[9]:
```

```
# This simply returns the (sorted) positional indices of the top 16 smallest distances found in the distance_profile
k = 16
idxs = np.argpartition(distance_profile, k)[:k]
idxs = idxs[np.argsort(distance_profile[idxs])]
```

And then let’s plot all of these subsequences based on their index locations:

```
[10]:
```

```
plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)
for idx in idxs:
plt.plot(range(idx, idx+len(Q_df)), T_df.values[idx:idx+len(Q_df)], lw=2)
plt.show()
```

Unsurprisingly, the top `k = 16`

nearest neighbors to `Q_df`

(or best matches, shown in multiple colors above) can all be found when the robot dog was walking on the carpet (grey)!

## Summary¶

And that’s it! You have now taken a known pattern of interest (or query), ran it through `core.mass`

using STUMPY, and you were able to quickly search for this pattern in another time series. With this newfound knowledge, you can now go and search for patterns in your own time series projects. Happy coding!

## Additional Note - Distance Profiles with Non-normalized Euclidean Distances¶

There are times when you may want to use non-normalized Euclidean distance as your measure of similarity/dissimilarity, and so, instead of using `core.mass`

(which z-normalizes your subsequences first before computing the pairwise Euclidean distances), you can use the `core.mass_absolute`

function. This is provided for those who are interested in computing the non-normalized matrix profiles that are available in the complementary `stumpy.aamp`

, `stumpy.aamped`

, `stumpy.gpu_aamp`

, and
`stumpy.ammpi`

functions.

## Bonus Section - What Makes MASS So Fast?¶

The reason why MASS is so much faster than a naive approach is because MASS uses Fast Fourier Transforms (FFT) to convert the data into the frequency domain and performs what is called a “convolution”, which reduces the `m`

operations down to `log(n)`

operations. You can read more about this in the original Matrix Profile I paper.

Here’s a naive implementation of computing a distance profile:

```
[11]:
```

```
def compute_naive_distance_profile(Q, T):
Q = Q.copy()
T = T.copy()
n = len(T)
m = len(Q)
naive_distance_profile = np.empty(n - m + 1)
start = time.time()
Q = stumpy.core.z_norm(Q)
for i in range(n - m + 1):
naive_distance_profile[i] = np.linalg.norm(Q - stumpy.core.z_norm(T[i:i+m]))
naive_elapsed_time = time.time()-start
print(f"For n = {n} and m = {m}, the naive algorithm takes {np.round(naive_elapsed_time, 2)}s to compute the distance profile")
return naive_distance_profile
```

For a random time series, `T_random`

, with 1 million data points and a random query subsequence, `Q_random`

:

```
[12]:
```

```
Q_random = np.random.rand(100)
T_random = np.random.rand(1_000_000)
naive_distance_profile = compute_naive_distance_profile(Q_random, T_random)
```

```
For n = 1000000 and m = 100, the naive algorithm takes 40.89s to compute the distance profile
```

The naive algorithm takes over half a minute to compute! However, MASS can handle this (and even larger data sets) in about a 1 second:

```
[13]:
```

```
start = time.time()
mass_distance_profile = stumpy.core.mass(Q_random, T_random)
mass_elapsed_time = time.time()-start
print(f"For n = {len(T_random)} and m = {len(Q_random)}, the MASS algorithm takes {np.round(mass_elapsed_time, 2)}s to compute the distance profile")
```

```
For n = 1000000 and m = 100, the MASS algorithm takes 0.12s to compute the distance profile
```

And to be absolutely certain, let’s make sure and check that the output is the same from both methods:

```
[14]:
```

```
npt.assert_almost_equal(naive_distance_profile, mass_distance_profile)
```

Success, no errors! This means that both outputs are identical. Go ahead and give it a try!