Can Root Cause Analysis (RCA) Be Done Faster?
IT administrators and support personnel are painfully aware of how hard it is to troubleshoot IT issues – it is challenging as it requires manually sifting through vast amounts of data and performing root cause analysis (RCA) to identify the cause. In today’s fast-paced and competitive business environment, IT teams can no longer take a laissez-faire attitude toward fixing IT issues as downtime has significant cost implications. The loss in productivity and sales is obvious, but the poor user experience can also lower employee morale, slow innovation, and impact employee retention.
To streamline the RCA process, fine-grained telemetry data are collected from end-user devices as well as networking equipment. The data are stored in large data lakes and are used for RCA when problems occur. In some cases, advanced Artificial Intelligence (AI) and Machine Learning (ML) techniques are used to predict the occurrence of problems in the future. In theory, all this is a good idea and should help the troubleshooting process. In practice, however, the massive amounts of data collected give rise to a new set of problems – how does one quickly sift through the data and isolate the root cause of the problem? To make matters worse, the telemetry data collected is often noisy and incomplete. It is like finding a needle in a haystack, except that in this case the haystack is enormous, and all other kinds of things are mixed in with the hay.
So, what do IT teams do today for RCA? Unfortunately, there is no standardized way to analyze telemetry data to troubleshoot IT problems. Typically, IT teams create “playbooks” over time based on collective experiences – new entries are added when a new problem is encountered and subsequently solved. Playbooks are the first line of defense for RCA, but they do not leverage the full value from all the telemetry data that is collected. Can the RCA process be automated – can we find the needle in the haystack, and find it fast? Yes, we can!
Automating RCA
Let’s consider an everyday example that IT teams are all too familiar with – applying software patches to Windows devices. Microsoft regularly releases patches to Windows software, typically on the second Tuesday of every month. IT teams test these patches first before deploying them to all the devices used by employees, and they do this in waves or rings to discover issues as early as possible. Despite these efforts, new incompatibilities are discovered after the deployment. The symptoms can be OS crashes, inability to run some applications, unresponsive or slow devices, and so on. If there is a sudden increase in the number of OS crashes, can we automatically detect that the increase was due to the patch that was rolled out recently? One way to do this is to find the number of instances where the patch was applied, and an OS crash occurred. Let’s say 100 devices crashed and 60 of them had the patch applied. The percentage of devices that crashed that had the patch is 60% – this is called the support for that specific patch. The specific patch is referred to as a feature, but the term “feature” can refer to other attributes and entities as well. More on this later.
Finding features with high support values is a good start for RCA but is not good enough. Here’s why. What if 100 other devices were running normally (i.e., did not encounter any OS crashes) and 60 of them had the specific patch applied? The base rate for this patch is also 60%. In this hypothetical example, the support and base rates are the same, and the high support value observed does not provide any useful information for RCA. Now let’s say that only 30 of the 100 devices running normally had the patch applied. In this case, the base rate is 30%, and the ratio of the support compared to the base rate is 2. In other words, it is two times more likely to see this patch when crashes occur compared to normal situations. The ratio of the support to the base rate is called the lift of a feature. A high support value of a feature means that it affects many devices, whereas a high lift value indicates that our confidence level is high.
The example considered above used a software patch as a feature. As noted earlier, the term “feature” is broader and can include other attributes. For example, features could include static attributes of the device such as make, model, OS version, physical memory size, and so on. It could also include dynamic elements such as a list of all the patches that were applied recently, a list of applications that were active before the crash occurred, a list of applications that were installed recently, and so on.
Guided RCA in Workspace ONE Intelligence
The telemetry data collected from devices can be used to compute support and lift values for each of the available features. Features with both high support and high lift values can guide the RCA process by narrowing the search space for finding the root cause. More importantly, support and lift values can be computed for combinations of features, where a combination is a set of one or more available features. For example, a combination can be a specific patch (e.g., KB5035967) and a specific device make/model (e.g., Dell Inspiron 3511). As the number of features increases, the number of combinations grows exponentially, making this computationally expensive. We have implemented this in Workspace ONE Intelligence using advanced pattern mining and machine learning techniques to efficiently process large volumes of telemetry data and find all combinations of features that have high support and lift values.
Figure 1: Results from Guided RCA tool in Experience Management within Workspace ONE Intelligence
Guided RCA can be accessed via the Workspace ONE Intelligence console in multiple ways. A natural place to start is from the Investigations page.
Figure 2: Investigations are located under Workspace tab > Experience Management in Workspace ONE Intelligence
Note that Investigations will only appear in the Intelligence admin console under the Workspace tab when Experience Management is enabled in the tenant. Please see this Getting Started Guide to learn more.
Users can click the RCA tab within an Investigation to access the Guided RCA tool. Contextual information from the incident is automatically transferred over. For example, if the incident was related to OS crashes, then the RCA type is preset to OS Crashes. However, the admin can change to any metric supported today – currently, this includes OS crashes, app crashes, and slow boot time.
Figure 3: Configuration before running the Guided RCA tool
Users can also change various advanced parameters associated with the RCA algorithm such as date range, support threshold, lift threshold, and the maximum length of patterns to search (i.e., maximum number of features in each combination). Furthermore, users can select the set of features to use in the RCA algorithm. For example, users can select only device attributes or only software patches if they want to restrict the set of features.
Figure 4: Setting advanced parameters for the Guided RCA tool in Workspace ONE Intelligence
The RCA process is started by hitting the RUN ANALYSIS button in the UI. Since it is a computationally intensive process, the amount of time it takes the tool to complete can be several minutes. Users are given an estimate of the run time before the job is submitted. In some cases, the RCA process is automatically run when an Insight is created – another machine learning-based feature that surfaces anomalies in your deployment. In this case, the results from the automatic run are prepopulated in the RCA page and users don’t need to wait. For easy navigability, the results are grouped based on the similarity of the feature patterns and are shown in a table. Each row shows an “anchor” item in the group and the confidence value associated with it – the anchor item is the most important feature that is present in all patterns belonging to a group. All patterns that are part of the same group are shown in the Related Results section below each of the anchor items.
Figure 5: List of anchor items in Guided RCA results and related results
Users can also see the list of devices impacted for each anchor item.
Figure 6: Click the SAMPLE EVENTS button in the Guided RCA results to see a list of impacted devices
The Guided RCA tool has already helped many Workspace ONE customers pinpoint issues. For example, a significant telco industry customer discovered many driver and hardware incompatibilities that had led to OS crash events for end users. A large software company used Guided RCA to confirm a specific OS patch was causing slow boot times. Another customer in the software industry was able to determine, with statistical significance, which version of a Windows app was causing crashes – all without manually studying charts and graphs.
Looking Ahead
Guided RCA is a powerful tool in Workspace ONE Intelligence that makes it easier for IT administrators to troubleshoot widespread issues across an enterprise and find the root cause. By using advanced pattern mining and other machine learning techniques, this tool analyzes large amounts of telemetry data very efficiently and extracts patterns to guide and speed up the troubleshooting process. We are continually improving this tool and are working on adding new use cases as well as incorporating additional telemetry data. RCA is a hard problem, and we realize that a single tool cannot solve all types of problems. We are developing new tools that will be available in the RCA toolchest that can be used to troubleshoot a wide variety of IT issues in enterprises.
To learn more about how Workspace ONE optimizes Digital Employee Experience (DEX), check out these resources:
- Workspace ONE Experience Management Getting Started Guide
- Guided RCA Product Documentation
- Workspace ONE Digital Employee Experience (DEX) Solution