# STUMPY Basics¶

## Analyzing Motifs and Anomalies with STUMP¶

This example utilizes the main takeaways from the research papers: Matrix Profile I & Matrix Profile II.

To explore the basic concepts, we’ll use the workhorse stump function to find interesting motifs (patterns) or discords (anomalies/novelties) and demonstrate these concepts with two different time series datasets:

1. The Steamgen dataset
2. The NYC taxi passengers dataset

stump is Numba JIT-compiled version of the popular STOMP algorithm that is described in detail in the original Matrix Profile II paper. stump is capable of parallel computation and it performs an ordered search for patterns and outliers within a specified time series and takes advantage of the locality of some calculations to minimize the runtime.

## Getting Started¶

Let’s import the packages that we’ll need to load, analyze, and plot the data.

[1]:

%matplotlib inline

import pandas as pd
import stumpy
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
from matplotlib.patches import Rectangle
import datetime as dt
import urllib
import ssl
import io
import os


The python functions below are going to be used throughout this example to automatically resize and create the plots that are displayed using the Matplotlib python plotting package.

[2]:

def change_plot_size(width, height, plt):
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = width
fig_size[1] = height
plt.rcParams["figure.figsize"] = fig_size
plt.rcParams['xtick.direction'] = 'out'

change_plot_size(20, 6, plt)


## What is a Motif?¶

Time series motifs are approximately repeated subsequences found within a longer time series. Being able to say that a subsequence is “approximately repeated” requires that you be able to compare subsequences to each other. In the case of STUMPY, all subsequences within a time series can be compared by computing the pairwise z-normalized Euclidean distances and then storing only the index to its nearest neighbor. This nearest neighbor distance vector is referred to as the matrix profile and the index to each nearest neighbor within the time series is referred to as the matrix profile index. Luckily, the stump function takes in any time series (with floating point values) and computes the matrix profile along with the matrix profile indices and, in turn, one can immediately find time series motifs. Let’s look at an example:

This data was generated using fuzzy models applied to mimic a steam generator at the Abbott Power Plant in Champaign, IL. The data feature that we are interested in is the output steam flow telemetry that has units of kg/s and the data is “sampled” every three seconds with a total of 9,600 datapoints.

[3]:

colnames = ['drum pressure',
'excess oxygen',
'water level',
'steam flow'
]

context = ssl.SSLContext()  # Ignore SSL certificate verification for simplicity
url = 'https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat'
data = io.BytesIO(raw_bytes)
steam_df.columns = colnames

[3]:

drum pressure excess oxygen water level steam flow
0 320.08239 2.506774 0.032701 9.302970
1 321.71099 2.545908 0.284799 9.662621
2 320.91331 2.360562 0.203652 10.990955
3 325.00252 0.027054 0.326187 12.430107
4 326.65276 0.285649 0.753776 13.681666

## Visualizing the Steamgen Dataset¶

[4]:

plt.suptitle('Steamgen Dataset', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Steam Flow', fontsize='20')
plt.plot(steam_df['steam flow'].values)

[4]:

[<matplotlib.lines.Line2D at 0x10c367320>]


Take a moment and carefully examine the plot above with your naked eye. If you were told that there was a pattern that was approximately repeated, can you spot it? Even for a computer, this can be very challenging. Here’s what you should be looking for:

## Manually Finding a Motif¶

[5]:

m = 640
fig, axs = plt.subplots(2)
plt.suptitle('Steamgen Dataset', fontsize='30')
axs[0].set_ylabel("Steam Flow", fontsize='20')
axs[0].plot(steam_df['steam flow'], alpha=0.5, linewidth=1)
axs[0].plot(steam_df['steam flow'].iloc[643:643+m])
axs[0].plot(steam_df['steam flow'].iloc[8724:8724+m])
rect = Rectangle((643, 0), m, 40, facecolor='lightgrey')
rect = Rectangle((8724, 0), m, 40, facecolor='lightgrey')
axs[1].set_xlabel("Time", fontsize='20')
axs[1].set_ylabel("Steam Flow", fontsize='20')
axs[1].plot(steam_df['steam flow'].values[643:643+m], color='C1')
axs[1].plot(steam_df['steam flow'].values[8724:8724+m], color='C2')

[5]:

[<matplotlib.lines.Line2D at 0x1297e38d0>]


The motif (pattern) that we are looking for is highlighted above and yet it is still very hard to be certain that the orange and green subsequences are a match (upper panel), that is, until we zoom in on them and overlay the subsequences on top each other (lower panel). Now, we can clearly see that the motif is very similar! The fundamental value of computing the matrix profile is that it not only allows you to quickly find motifs but it also identifies the nearest neighbor for all subsequences within your time series. Note that we haven’t actually done anything special here to locate the motif except that we grab the locations from the original paper and plotted them. Now, let’s take our steamgen data and apply the stump function to it:

## Find a Motif Using STUMP¶

[6]:

m = 640
mp = stumpy.stump(steam_df['steam flow'], m)


stump requires two parameters:

1. A time series
2. A window size, m

In this case, based on some domain expertise, we’ve chosen m = 640, which is roughly equivalent to half-hour windows. And, again, the output of stump is an array that contains all of the matrix profile values (i.e., z-normalized Euclidean distance to your nearest neighbor) and matrix profile indices in the first and second columns, respectively (we’ll ignore the third and fourth columns for now). Let’s plot the matrix profile next to our raw data:

[7]:

fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
plt.suptitle('Motif (Pattern) Discovery', fontsize='30')

axs[0].plot(steam_df['steam flow'].values)
axs[0].set_ylabel('Steam Flow', fontsize='20')
rect = Rectangle((643, 0), m, 40, facecolor='lightgrey')
rect = Rectangle((8724, 0), m, 40, facecolor='lightgrey')
axs[1].set_xlabel('Time', fontsize ='20')
axs[1].set_ylabel('Matrix Profile', fontsize='20')
axs[1].axvline(x=643, linestyle="dashed")
axs[1].axvline(x=8724, linestyle="dashed")
axs[1].plot(mp[:, 0])

[7]:

[<matplotlib.lines.Line2D at 0x12b5bcc50>]


What we learn is that the global minima (vertical dashed lines) from the matrix profile correspond to the locations of the two subsequences that make up the motif pair! And the exact z-normalized Euclidean distance between these two subsequences is:

[8]:

mp[:, 0].min()

[8]:

5.49161982776594


So, this distance isn’t zero since we saw that the two subsequences aren’t an identical match but, relative to the rest of the matrix profile (i.e., compared to either the mean or median matrix profile values), we can understand that this motif is a significantly good match.

## Find Anomalies using STUMP¶

Conversely, the maximum value in the matrix profile (computed from stump above) is:

[9]:

mp[:, 0].max()

[9]:

23.47616836730375


The matrix profile index also tells us which subsequence within the time series does not have nearest neighbor that resembles itself:

[10]:

np.argwhere(mp[:, 0] == mp[:, 0].max()).flatten()[0]

[10]:

3864


The subsequence located at this global maximum is also referred to as a discord, novelty, or anomaly:

[11]:

fig, axs = plt.subplots(2, sharex=True, gridspec_kw={'hspace': 0})
plt.suptitle('Discord (Anomaly/Novelty) Discovery', fontsize='30')

axs[0].plot(steam_df['steam flow'].values)
axs[0].set_ylabel('Steam Flow', fontsize='20')
rect = Rectangle((3864, 0), m, 40, facecolor='lightgrey')
axs[1].set_xlabel('Time', fontsize ='20')
axs[1].set_ylabel('Matrix Profile', fontsize='20')
axs[1].axvline(x=3864, linestyle="dashed")
axs[1].plot(mp[:, 0])

[11]:

[<matplotlib.lines.Line2D at 0x12b62efd0>]


Now that you’ve mastered the STUMPY basics and understand how to discover motifs and anomalies from a time series, we’ll leave it up to you to investigate other interesting local minima and local maxima in the steamgen dataset.

To further develop/reinforce our growing intuition, let’s move on and explore another dataset!

First, we’ll download historical data that represents the half-hourly average of the number of NYC taxi passengers over 75 days in the Fall of 2014.

We extract that data and insert it into a pandas dataframe, making sure the timestamps are stored as datetime objects and the values are of type float64. Note that we’ll do a little more data cleaning than above just so you can see an example where the timestamp is included. But be aware that stump does not actually use or need the timestamp column at all when computing the matrix profile.

[12]:

taxi_df = pd.read_csv("https://raw.githubusercontent.com/stanford-futuredata/ASAP/master/Taxi.csv", sep=',')
taxi_df['value'] = taxi_df['value'].astype(np.float64)
taxi_df['timestamp'] = pd.to_datetime(taxi_df['timestamp'])

[12]:

timestamp value
0 2014-10-01 00:00:00 12751.0
1 2014-10-01 00:30:00 8767.0
2 2014-10-01 01:00:00 7005.0
3 2014-10-01 01:30:00 5257.0
4 2014-10-01 02:00:00 4189.0

## Visualizing the Taxi Dataset¶

[13]:

# This code is going to be utilized to control the axis labeling of the plots
DAY_MULTIPLIER = 7  # Specify for the amount of days you want between each labeled x-axis tick

x_axis_labels = taxi_df[(taxi_df.timestamp.dt.hour==0)]['timestamp'].dt.strftime('%b %d').values[::DAY_MULTIPLIER]
x_axis_labels[1::2] = " "
x_axis_labels, DAY_MULTIPLIER

plt.suptitle('Taxi Passenger Raw Data', fontsize='30')
plt.xlabel('Window Start Date', fontsize ='20')
plt.ylabel('Half-Hourly Average\nNumber of Taxi Passengers', fontsize='20')
plt.plot(taxi_df['value'])

plt.xticks(np.arange(0, taxi_df['value'].shape[0], (48*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.minorticks_on()
plt.margins(x=0)
plt.show()


It seems as if there is a general periodicity between spans of 1-day and 7-days, which can likely be explained by the fact that more people use taxis throughout the day than through the night and that it is reasonable to say most weeks have similar taxi-rider patterns. Also, maybe there is an outlier just to the right of the window starting near the end of October but, other than that, there isn’t anything you can conclude from just looking at the raw data.

## Generating the Matrix Profile¶

Again, defining the window size, m, usually requires some level of domain knowledge but we’ll demonstrate later on that stump is robust to changes in this parameter. Since this data was taken half-hourly, we chose a value m = 48 to represent the span of exactly one day:

[14]:

m = 48
mp = stumpy.stump(taxi_df['value'], m=m)


## Visualizing the Matrix Profile¶

[15]:

plt.suptitle('1-Day STUMP', fontsize='30')
plt.xlabel('Window Start', fontsize ='20')
plt.ylabel('Matrix Profile', fontsize='20')
plt.plot(mp[:, 0])

plt.plot(575, 1.7, marker="v", markersize=15, color='b')
plt.text(620, 1.6, 'Columbus Day', color="black", fontsize=20)
plt.plot(1535, 3.7, marker="v", markersize=15, color='b')
plt.text(1580, 3.6, 'Daylight Savings', color="black", fontsize=20)
plt.plot(2700, 3.1, marker="v", markersize=15, color='b')
plt.text(2745, 3.0, 'Thanksgiving', color="black", fontsize=20)
plt.plot(30, .2, marker="^", markersize=15, color='b', fillstyle='none')
plt.plot(363, .2, marker="^", markersize=15, color='b', fillstyle='none')
plt.xticks(np.arange(0, 3553, (m*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.minorticks_on()
plt.show()


## Understanding the Matrix Profile¶

Let’s understand what we’re looking at.

### Lowest Values¶

The lowest values (open triangles) are considered a motif since they represent the pair of nearest neighbor subsequences with the smallest z-normalized Euclidean distance. Interestingly, the two lowest data points are exactly 7 days apart, which suggests that, in this dataset, there may be a periodicity of seven days in addition to the more obvious periodicity of one day.

### Highest Values¶

So what about the highest matrix profile values (filled triangles)? The subsequences that have the highest (local) values really emphasizes their uniqueness. We found that the top three peaks happened to correspond exactly with the timing of Columbus Day, Daylight Saving Time, and Thanksgiving, respectively.

## Different Window Sizes¶

As we had mentioned above, stump should be robust to the choice of the window size parameter, m. Below, we demonstrate how manipulating the window size can have little impact on your resulting matrix profile by running stump with varying windows sizes.

[16]:

days_dict ={
"Half-Day": 24,
"1-Day": 48,
"2-Days": 96,
"5-Days": 240,
"7-Days": 336,
}

days_df = pd.DataFrame.from_dict(days_dict, orient='index', columns=['m'])

[16]:

m
Half-Day 24
1-Day 48
2-Days 96
5-Days 240
7-Days 336

We purposely chose spans of time that correspond to reasonably intuitive day-lengths that could be chosen by a human.

[17]:

fig, axs = plt.subplots(5, sharex=True, gridspec_kw={'hspace': 0})
fig.text(0.5, -0.1, 'Subsequence Start Date', ha='center', fontsize='20')
fig.text(0.08, 0.5, 'Matrix Profile', va='center', rotation='vertical', fontsize='20')
for i, varying_m in enumerate(days_df['m'].values):
mp = stumpy.stump(taxi_df['value'], varying_m)
axs[i].plot(mp[:, 0])
axs[i].set_ylim(0,9.5)
axs[i].set_xlim(0,3600)
title = f"m = {varying_m}"
axs[i].set_title(title, fontsize=20, y=.5)
plt.xticks(np.arange(0, taxi_df.shape[0], (48*DAY_MULTIPLIER)/2), x_axis_labels)
plt.xticks(rotation=75)
plt.suptitle('STUMP with Varying Window Sizes', fontsize='30')
plt.show()


We can see that even with varying window sizes, our peaks stay prominent. But it looks as if all the non-peak values are converging towards each other. This is why having a knowledge of the data-context is important prior to running stump, as it is helpful to have a window size that may capture a repeating pattern or anomaly within the dataset.

## GPU-STUMP - Faster STUMP Using GPUs¶

When you have significantly more than a few thousand data points in your time series, you may need a speed boost to help analyze your data. Luckily, you can try gpu_stump, a super fast GPU-powered alternative to stump that gives speed of a few hundred CPUs and provides the same output as stump:

import stumpy

mp = stumpy.gpu_stump(df['value'], m=m)  # Note that you'll need a properly configured NVIDIA GPU for this


In fact, if you aren’t dealing with PII/SII data, then you can try out gpu_stump using the this notebook on Google Colab.

## STUMPED - Distributed STUMP¶

Alternatively, if you only have access to a cluster of CPUs and your data needs to stay behind your firewall, then stump and gpu_stump may not be sufficient for your needs. Instead, you can try stumped, a distributed and parallel implementation of stump that depends on Dask distributed:

import stumpy


And that’s it! You have now loaded in a dataset, ran it through stump using our package, and were able to extract multiple conclusions of existing patterns and anomalies within the two different time series. You can now import this package and use it in your own projects. Happy coding!