Next Generation Insights to Handle Data with Seasonality
As an IT administrator in a large enterprise, you want to make sure that the entire IT infrastructure is not only up all the time but also functioning efficiently. If a critical service comes down suddenly or if several end-users start experiencing issues, you want to know about it and fix it as soon as possible. To address these needs, we introduced Insights in Workspace ONE Intelligence. Since its introduction in early 2022, customer interest in Insights has steadily grown and is currently one of the most visited pages in the Intelligence console. Recently, we announced the next generation of Insights that is powered by machine learning models. In this blog post, we will go under the hood and look at how Insights are generated. To set the context, we will start with the first generation and then go over the recent enhancements.
Insights in Workspace ONE Intelligence
Insights can be accessed via the Experience Management menu in the Workspace ONE Intelligence console.
Insights are shown using cards. Each card corresponds to one insight and displays a brief description of the event associated with the Insight, the time the event occurred, the number of devices or users that were affected by the event, and other meta-data such as the device platform type. Users can sort the Insights by the time it was created or when the last comment was made, and filter using different criteria such as Insights related to specific apps, devices, users, and so on. For more information on accessing and working with Insights in the Intelligence console, please refer to the product documentation at Experience Management Solution.
In this post we are going to look at how insights are generated. Let’s start with a simple example. Suppose we want to make sure that end-user devices are continuously up and functioning properly. If several devices are experiencing OS crashes, that would be a problem. So, it makes sense to monitor the number of OS crashes across the organization periodically, say every hour. This gives us a time-series where the time step is one hour, and the number of OS crashes is the value of the time series. At every time step, the observed value can be compared to some threshold to determine if it is an outlier – is it out of expected bounds? If it is an outlier, an Insight can be created. For example, if the number of OS crashes exceeds the threshold, the Insight would say that an unexpected increase in OS crashes was encountered. But several questions remain. How does one know what events or metrics to monitor? And how does one set the threshold to compare against?
The discussion in the previous paragraph used OS crashes as an example. But OS crashes is just one of many events that you might want to monitor. In practice, it is unlikely that you even know what events or metrics need to be monitored. This is where the power of Workspace ONE Intelligence becomes clear. It automatically monitors thousands of metrics for each organization that includes OS crashes, boot duration, shutdown duration, CPU utilization, application crashes and hangs per application, application usage, failed logins, failed application installations, networks errors, and many more. We are constantly adding new metrics that are monitored by our Insights engine.
Basic Outlier Detection using Automatic Baselining
Now let’s get back to the problem of setting thresholds. One approach is to use prior knowledge to set a threshold based on the quantity being measured. For example, if you know that ten OS crashes per hour is too much, you can set the threshold to ten. However, such prior knowledge is rarely available. Besides, this approach is impractical when you have thousands of metrics to monitor. Workspace ONE Intelligence solves this problem by automatically and dynamically setting thresholds based on baselines that are derived from recent measurements.
Like before, let’s consider an example. Suppose we are interested in monitoring the amount of time it takes to boot devices running Microsoft Windows. The number of devices that boot up every hour will vary by the hour, but we can compute the average or median boot time across all devices that boot up during each hour.
Figure 1: Automatic baselining using recent history.
The chart in Figure 1 shows a hypothetical example of how the observation value (in this case, the average boot time in seconds) varies hour-by-hour over a two-week period. The solid blue line is the observation value, and the lightly shaded area is the expected range. The expected range at any time step is computed by looking back at the recent history. For example, the history period can be 1 week (168 hours). The distribution of observation values in the recent history is used to compute lower and upper bounds for the expected range. If the actual observed value at any time step falls outside the expected range, it is marked as an outlier, and an insight is created. Referring to Figure 1, for the most part the observation value is in a tight band around 60 seconds. But around time step 200, there is a sudden jump in the value to 85 and is automatically detected as an outlier since it falls outside the expected range.
Handling Data with Seasonality
The sample data shown in Figure 1 is typical for certain metrics like boot duration and shutdown duration – measurements that are generally stable and don’t depend on the time of the day, day of the week, or other externalities. However, many other metrics do depend on factors like the time of the day and the day of the week. For example, consider the number of OS crashes observed every hour. It makes intuitive sense to see an increase in the number of crashes when more devices are in use, and similarly a decrease in the number of crashes when fewer devices are in use. The number of devices that are actively in use varies depending on the time of the day and the day of the week. In most cases, more devices are actively used during weekdays compared to weekends, and certain hours during the day (e.g., mid-morning or early afternoon) are more active periods compared to others (e.g., nights). In other words, the usage of devices has daily and weekly cycles. Such cyclic behavior is referred to as seasonality. How does the basic outlier detection method described in the previous section perform on data with seasonality? Let’s look at the example shown in Figure 2.
Figure 2: Basic outlier detection applied to data with seasonality.
The observation value shown in Figure 2 (e.g., number of OS crashes) covers a span of two weeks and has daily and weekly cycles. The solid blue line corresponds to the measured value and the lightly shaded area is the expected range that is computed from the distribution of data in the most recent history (e.g., past 1 week). The taller peaks correspond to weekdays and the shorter peaks correspond to weekends. Now, consider what happens when a sudden spike is seen during the weekend – like what is shown at time step 142 in Figure 2. Since the expected range is computed by looking at data from the past 1 week which is dominated by weekdays, the spike seen during the weekend is categorized as normal and the outlier is missed. Only extreme spikes like the one at time step 174 are categorized as outliers. So, the basic outlier detection method does not work well when the observed data has seasonality. The next generation of the Insights engine in Workspace ONE Intelligence addresses this issue by using advanced outlier detection methods based on machine learning.
Advanced Outlier Detection using Machine Learning
As noted earlier, the first generation of the Insights engine works well when the observed quantity does not have any seasonality. The next generation of the Insights engine explicitly handles seasonalities in the data by using powerful time-series forecasting models. The forecasting model is used to predict the value at the next time step given the recent history. The observed value is then compared with the predicted value to determine the error – if the error is “large,” the observation is marked as an outlier. But what does “large” mean? This is determined by computing an anomaly score based on the error. The anomaly score is a number between 0 and 1 and indicates the extent to which a given observation is an outlier and is simply the complement of the probability of encountering an error greater than or equal to the observed error (i.e., anomaly score = 1 – error probability). The validation data that is used while training the forecasting models is used to determine the distribution of the prediction error; the error distribution is used in turn to compute the probability of encountering an error greater than or equal to the observed error. Finally, the anomaly score is compared to a threshold to determine if an observed value is an outlier.
Figure 3: Advanced outlier detection applied to data with seasonality.
Figure 3 illustrates what happens when the advanced outlier detection algorithm is applied to time-series data with seasonality. The blue solid line is the observed value and is the same as that in Figure 2. Two things stand out. The first is that the expected range, shown by the lightly shaded area, follows the same pattern as the observed value. The second is that the sudden spike during the weekend at time step 142 is categorized as an outlier, in addition to the outlier detected at time step 174. In contrast to this, the expected range shown in Figure 2 does not follow the patterns in the observed value, and as a result the sudden spike during the weekend goes undetected.
Sneak Peek into the Future
While the second generation of the Insights engine is a big improvement over the first, we will be introducing several more enhancements in the future.
- Users will have more control over how and when insights are generated. For example, users can set manual thresholds if they have prior information, and it suits their needs. Users can also set the sensitivity thresholds to control the number of insights.
- New metrics to monitor automatically will be added to extend the set of already monitored metrics. In addition, users can define their own customized metrics to monitor.
- The current outlier detection algorithm looks for sudden changes in a measured value compared to its recent short-term history. In the future, we will include new types of insights that are based on long-term trends as well. We also plan to introduce other types of insights based on population-based anomalies rather than time-series anomalies.