{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Matrix Profile\n",
"\n",
"## Laying the Foundation\n",
"\n",
"At its core, the STUMPY library efficiently computes something called a **matrix profile**, a vector that stores the [z-normalized Euclidean distance](https://youtu.be/LnQneYvg84M?t=374) between any subsequence within a time series and its nearest neighbor.\n",
"\n",
"To fully understand what this means, let's take a step back and start with a simple illustrative example along with a few basic definitions:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time Series with Length n = 13"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"time_series = [0, 1, 3, 2, 9, 1, 14, 15, 1, 2, 2, 10, 7]\n",
"n = len(time_series)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To analyze this time series with length `n = 13`, we could visualize the data or calculate global summary statistics (i.e., mean, median, mode, min, max). If you had a much longer time series, then you may even feel compelled to build an ARIMA model, perform anomaly detection, or attempt a forecasting model but these methods can be complicated and may often have false positives or no interpretable insights."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Time Series Visualization](images/time_series_viz.jpeg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, if we were to apply *Occam's Razor*, then what is the most **simple and intuitive** approach that we could take analyze to this time series?\n",
"\n",
"To answer this question, let's start with our first defintion:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Subsequence /ˈsəbsəkwəns/ noun\n",
"\n",
"### *a part or section of the full time series*\n",
"\n",
"So, the following are all considered subsequences of our `time_series` since they can all be found in the time series above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Subsequence 1](images/subsequence_a.jpeg)\n",
"![Subsequence 2](images/subsequence_b.jpeg)\n",
"![Subsequence 3](images/subsequence_c.jpeg)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 1]\n",
"[9, 1, 14]\n",
"[3, 2, 9, 1, 14, 15, 1, 2]\n"
]
}
],
"source": [
"print(time_series[0:2])\n",
"print(time_series[4:7])\n",
"print(time_series[2:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that each subsequence can have a different sequence length that we'll call `m`. So, for example, if we choose `m = 4`, then we can think about how we might compare any two subsequences of the same length."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 1, 3, 2] [1, 2, 2, 10]\n"
]
}
],
"source": [
"m = 4\n",
"i = 0 # starting index for the first subsequence\n",
"j = 8 # starting index for the second subsequence\n",
"\n",
"subseq_1 = time_series[i:i+m]\n",
"subseq_2 = time_series[j:j+m]\n",
"\n",
"print(subseq_1, subseq_2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Subsequences](images/subsequences.jpeg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way to compare any two subsequences is to calculate what is called the Euclidean distance.\n",
"\n",
"## Euclidean Distance /yo͞oˈklidēən/ /ˈdistəns/ noun\n",
"\n",
"### *the straight-line distance between two points*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Euclidean Distance](images/euclidean_distance.jpeg)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The square root of 67 = 8.18535277187245\n"
]
}
],
"source": [
"import math\n",
"\n",
"D = 0\n",
"for k in range(m):\n",
" D += (time_series[i+k] - time_series[j+k])**2\n",
"print(f\"The square root of {D} = {math.sqrt(D)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Distance Profile - Pairwise Euclidean Distances\n",
"\n",
"Now, we can take this a step further where we keep one subsequence the same (reference subsequence), change the second subsequence in a sliding window manner, and compute the Euclidean distance for each window. The resulting vector of pairwise Euclidean distances is also known as a **distance profile**.\n",
"\n",
"![Pairwise Euclidean Distance](images/pairwise_euclidean_distance.gif)\n",
"\n",
"Of course, not all of these distances are useful. Specifically, the distance for the self match (or trivial match) isn't informative since the distance will be always be zero when you are comparing a subsequence with itself. So, we'll ignore it and, instead, take note of the next smallest distance from the distance profile and choose that as our best match:\n",
"\n",
"![Trivial Match](images/trivial_match.jpeg)\n",
"\n",
"Next, we can shift our reference subsequence over one element at a time and repeat the same sliding window process to compute the distance profile for each new reference subsequence.\n",
"\n",
"![Distance Profiles](images/distance_matrix.gif)\n",
"\n",
"## Distance Matrix\n",
"\n",
"If we take all of the distance profiles that were computed for each reference subsequence and stack them one on top of each other then we get something called a **distance matrix**\n",
"\n",
"![Distance Matrix](images/distance_matrix.jpeg)\n",
"\n",
"Now, we can simplify this distance matrix by only looking at the nearest neighbor for each subsequence and this takes us to our next concept:\n",
"\n",
"## Matrix Profile /ˈmātriks/ /ˈprōˌfīl/ noun \n",
"\n",
"### a vector that stores the [(z-normalized) Euclidean distance](https://youtu.be/LnQneYvg84M?t=374) between any subsequence within a time series and its nearest neighbor\n",
"\n",
"Practically, what this means is that the matrix profile is only interested in storing the smallest non-trivial distances from each distance profile, which significantly reduces the spatial complexity to O(n):\n",
"\n",
"![Matrix Profile](images/matrix_profile.gif)\n",
"\n",
"We can now plot this matrix profile underneath our original time series. And, as it turns out, a reference subsequence with a small matrix profile value (i.e., it has a nearest neighbor significantly \"closeby\") may indicate a possible pattern while a reference subsequence with a large matrix profile value (i.e., its nearest neighbor is significantly \"faraway\") may suggest the presence of an anomaly. \n",
"\n",
"![Time Series Matrix Profile ](images/matrix_profile.jpeg)\n",
"\n",
"So, by simply computing and inspecting the matrix profile alone, one can easily pick out the top pattern (global minimum) and rarest anomaly (global maximum). And this is only a small glimpse into what is possible once you've computed the matrix profile!\n",
"\n",
"## The Real Problem - The Brute Force Approach\n",
"\n",
"Now, it might seem pretty straightforward at this point but what we need to do is consider how to compute the full distance matrix efficiently. Let's start with the brute force approach: "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"for i in range(n-m+1):\n",
" for j in range(n-m+1):\n",
" D = 0\n",
" for k in range(m):\n",
" D += (time_series[i+k] - time_series[j+k])**2\n",
" D = math.sqrt(D)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At first glance, this may not look too bad but if we start considering both the computational complexity as well as the spatial complexity then we begin to understand the real problem. It turns out that, for longer time series (i.e., *n >> 10,000*) the computational complexity is *O(n*^{2}m) (as evidenced by the three for loops in the code above) and the spatial complexity for storing the full distance matrix is *O(n*^{2}). \n",
"\n",
"To put this into perspective, imagine if you had a single sensor that collected data 20 times/min over the course of 5 years. This would result:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There would be n = 52416000 data points\n"
]
}
],
"source": [
"n = 20 * 60 * 24 * 364 * 5 # 20 times/min x 60 mins/hour x 24 hours/day x 365 days/year x 5 years\n",
"print(f\"There would be n = {n} data points\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming that each calculation in the inner loop takes 0.0000001 seconds then this would take:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"It would take 137371850.1792 seconds to compute\n"
]
}
],
"source": [
"time = 0.0000001 * (n * n - n)/2\n",
"print(f\"It would take {time} seconds to compute\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Which is equivalent to 1,598.7 days (or 4.4 years) and 11.1 PB of memory to compute!** So, it is clearly not feasible to compute the distance matrix using our naive brute force method. Instead, we need to figure out how to reduce this computational complexity by efficiently generating a matrix profile and this is where *STUMPY* comes into play. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## STUMPY\n",
"\n",
"In the fall of 2016, researchers from the [University of California, Riverside](https://www.cs.ucr.edu/~eamonn) and the [University of New Mexico](https://www.cs.unm.edu/~mueen/) published a beautiful set of [back-to-back papers](https://www.cs.ucr.edu/~eamonn/MatrixProfile.html) that described an __exact method__ called **STOMP** for computing the matrix profile for any time series with a computational complexity of O(n^{2})! They also further demonstrated this using GPUs and they called this faster approach **GPU-STOMP**.\n",
"\n",
"With the academics, data scientists, and developers in mind, we have taken these concepts and have open sourced STUMPY, a powerful and scalable library that efficiently computes the matrix profile according to this published research. And, thanks to other open source software such as [Numba](http://numba.pydata.org/) and [Dask](https://dask.org/), our implementation is highly parallelized (for a single server with multiple CPUs or, alternatively, multiple GPUs), highly distributed (with multiple CPUs across multiple servers). We've tested STUMPY on as many as 256 CPU cores (spread across 32 servers) or 16 NVIDIA GPU devices (on the same DGX-2 server) and have achieved similar [performance](https://github.com/TDAmeritrade/stumpy#performance) to the published GPU-STOMP work."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"According to the original authors, \"these are the best ideas in times series data mining in the last two decades\" and \"given the matrix profile, [most time series data mining problems are trivial to solve in a few lines of code](https://www.cs.ucr.edu/~eamonn/100_Time_Series_Data_Mining_Questions__with_Answers.pdf)\". \n",
"\n",
"From our experience, this is definitely true and we are excited to share STUMPY with you! Please reach out and let us know how STUMPY has enabled your time series analysis work as we'd love to hear from you!\n",
"\n",
"## Additional Notes\n",
"\n",
"For the sake of completeness, we'll provide a few more comments for those of you who'd like to compare your own matrix profile implementation to STUMPY. However, due to the many details that are omitted in the original papers, we strongly encourage you to use [STUMPY](https://stumpy.readthedocs.io/en/latest/).\n",
"\n",
"In our explanation above, we've only excluded the trivial match from consideration. However, this is insufficient since nearby subsequences (i.e., `i ± 1`) are likely highly similar and we need to expand this to a larger \"exclusion zone\" relative to the diagonal trivial match. Here, we can visualize what different exclusion zones look like: \n",
"\n",
"![Exclusion Zone](images/trivial.jpeg)\n",
"\n",
"However, in practice, it has been found that an exclusion zone of `i ± int(np.ceil(m / 4))` works well (where `m` is the subsequence window size) and the distances computed in this region are is set to `np.inf` before the matrix profile value is extracted for the `ith` subsequence. Thus, the larger the window size is, the larger the exclusion zone will be. Additionally, note that, since NumPy indexing has an inclusive start index but an exlusive stop index, the proper way to ensure a symmetrical exclusion zone is:\n",
"\n",
"```\n",
"excl_zone = int(np.ceil(m / 4))\n",
"zone_start = i - excl_zone\n",
"zone_end = i + excl_zone + 1 # Notice that we add one since this is exclusive\n",
"distance_profile[zone_start : zone_end] = np.inf\n",
"```\n",
"\n",
"Finally, it is very uncommon for users to need to change the default exclusion zone. However, in exceptionally rare cases, the exclusion zone can be changed globally in STUMPY through the `config.STUMPY_EXCL_ZONE_DENOM` parameter where all exclusion zones are computed as `i ± int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))`:\n",
"\n",
"```\n",
"import stumpy\n",
"from stumpy import config\n",
"\n",
"config.STUMPY_EXCL_ZONE_DENOM = 4 # The exclusion zone is i ± int(np.ceil(m / 4)) and is the same as the default setting\n",
"mp = stumpy.stump(T, m)\n",
"\n",
"config.STUMPY_EXCL_ZONE_DENOM = 1 # The exclusion zone is i ± m\n",
"mp = stumpy.stump(T, m)\n",
"\n",
"config.STUMPY_EXCL_ZONE_DENOM = np.inf # The exclusion zone is i ± 0 and is the same as applying no exclusion zone\n",
"mp = stumpy.stump(T, m)\n",
"```\n",
"\n",
"## Resources\n",
"\n",
"\n",
"[STUMPY Documentation](https://stumpy.readthedocs.io/en/latest/)\n",
"\n",
"[STUMPY Matrix Profile Github Code Repository](https://github.com/TDAmeritrade/stumpy)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}