5 Ways to Use NLP for Smarter Incident Management During a Crisis
This is the second of a two-part series on this topic. Part 1 introduces Natural Language Processing (NLP) and how it could transform Incident Management. Part 2 covers five real world use cases for how to use this technique.
With business activities throughout the globe facing major disruption, keeping your incident teams focused on activities that will keep the lights on is all the more difficult for IT response teams. However, they have a weapon against incident uncertainty, and it comes from IT's own emails, ticket logs, and other unstructured human entered data.
As noted in Part I of this series, there is a treasure trove of insights in the human generated information entered into your Incidents. Let's look at a few use cases where the NLP engine can be applied to Incident data to inform and drive business actions, minimizing the disruptive impact incidents can have on business continuity during our current time of global crisis.
Use Case 1: Reduce the Volume of Incidents
Clusters of Incidents that are identified by the NLP engine can point to a recurring set of issues of a similar kind. These issues can then be eliminated through Process Improvement, Change or other appropriate steps.
One of our customers had a steady stream of Incidents related to login expiry and password reset across Applications. The nature of the Incidents was identified based on the keywords present in an Incident Cluster uncovered by the Numerify NLP engine. While for an individual Application, the volume of such Incidents was not very high, the cumulative count across Applications was contributing meaningfully to the total Incidents. The Process Manager had assumed that the availability of an SSO based self-service password reset utility would have removed the need for users to log Incidents of this nature. But the real issue was that the self-service utility was not easily accessible and thus unknown to a lot of users. Better accessibility coupled with an email campaign for increased awareness resulted in a drastic reduction in Incidents of such nature across Applications.
Use Case 2: "Shift Left" Incident Resolution
Using NLP to identify probable root cause and next best action in an Incident cluster can make it a good candidate for "Shift Left" i.e. resolution in lower Levels (or Tiers) of the Incident Assignment Group hierarchy. Resolution of an Incident in L1 is not only cheaper (lower resource costs) but also saves on the resolution time leading to a lower overall MTTR (Mean Time to Resolution).
Keywords in an Incident Cluster revealed configuration mismatches across various instances for key customer facing applications. Given the criticality of the applications, all Incidents associated with these Applications were escalated to L2 or L3 Assignment Groups. A Knowledge Base article covering the detailed configuration steps for these Applications was created and made available to the L1 team. Such Incidents then started getting resolved at L1, saving precious time for the L2/L3 Assignment Groups who could then focus on higher value activities such as Development or on P1 issues.
Use Case 3: Identify Emerging Incident Threats
Incident analysis tends to be heavily focused on volume metrics. ITSM leaders are always looking at metrics like Configuration Items (CIs) with the highest incident volume, assignees with the highest number of Open Incidents and other similar metrics & trends with high volume. One area of analytics that doesn't always get due credit is monitoring for emerging threats—that is, a metric or behavior that may not be problematic at present but is showing signs of becoming so in the next few weeks or months.
Low volume Incident clusters that are seeing a steady upward trend in a month-to-month or week-to-week basis fall into this emerging threat category. While the volume of Incidents in a Cluster may not be presently large, it may be heading in that direction and thus would warrant immediate action. Trending Incident clusters can indicate such emerging clusters, and certain cluster descriptions can likewise indicate an upcoming problem that may soon have a large impact.
A Numerify customer rolled out a new Enterprise communication app for its employees. The rollout was deemed successful because there were no immediate issues, but that was because the initial usage of the app was low. As more employees started using the app, many Incidents were logged related to this app. While a chunk of these Incidents were account creation or "how-to" related, the Incident clusters revealed a steady upward trend in performance-related issues. As it turned out, increasing active users was resulting in performance issues. Trending this Incident Cluster over a few months of data brought out this steady uptrend, which was then quickly addressed with a hardware configuration fix.
Use Case 4: RCA (Root Cause Analysis) for Incidents
Almost all information on the Root Cause investigation for an individual Incident gets logged into text fields like Resolution Notes, Root Cause Description, etc. Thus, Incident clustering can be extremely impactful in the RCA process. Once RCA is done for one type of Incident, the same RCA can be extended to all other Incidents that belong to the same Cluster, enabling repeatability in the RCA process. This reduces the number of Incidents closed without RCA while at the same time enabling quicker corrective action for Incident resolution.
A Numerify customer operates a chain of retail stores across the US and manages a large deployment of point-of-sale machines. During a two-month period, there were a large number of Incidents related to transaction timeouts. Given the geographical spread of the business, the customer had set up separate support teams for every state/region. The analytics team at HQ identified this cluster of transaction timeout related Incidents from the Incident Clusters and extracted RCA that a regional support team had documented. These Root Cause remediation steps were then standardized and broadcast to all support teams across the country to eliminate Incidents of this kind across all stores.
Use Case 5: Effective Problem Management
Per ITSM processes, the main objective of Problem Resolution is to resolve and prevent Incidents of the same kind from occurring repeatedly. These generally get identified by the Problem Management process. In Application 4, we talked about leveraging Incident Clusters to identify such Problems and get them fixed.
Going beyond that, Incident Clustering can also be beneficial in understanding Problem rollout effectiveness. In an ideal scenario, all Incidents related to a Problem should be resolved once the Problem is fixed. Subsequently, no new Incidents related to the Problem should be created. Generally, IT teams struggle to measure this effectiveness.
Identifying trends in the Problem Cluster's Incident volume over a time period spanning the Problem resolution can help. An effective Problem rollout should see the trend of Incident clusters sharply going down towards zero after they are employed. Any other trend essentially points towards an ineffective Problem fix.
Better Incident Management for IT departments
Taken together, these five applications of NLP can be instrumental in effective incident management. Firstly, you are reducing the overall volume of incidents and dealing with lower level ones by using the time and cost effective "shift left" method. Equally important is identifying which factors are likely to constitute an emerging threat so that you can manage it before an incident occurs.
If an incident does occur, the next steps are conducting an RCA, or root cause analyses, and finally applying NLP to resolve incidents that do arise. While each of these applications is an excellent use of NLP to analyze incident clusters, utilizing all five provides for a wider range of contingencies.
Having resources like these during a moment of crisis gives IT teams added information, makes their response to incidents more robust, and helps them address threats to business continuity that may be flying under the radar. Put together, these capabilities help IT organizations manage risks and add just a bit of certainty to uncertain times.
Want to learn how you can leverage Natural Language Processing (NLP) techniques to make sense of the human-generated incident data in your organization? Watch Abhijeet Joshi, Senior Product Manager, and Joe Foley, Director of Business Analytics, to learn how organizations like yours:
- Achieve Smart Incident Management with AI-Powered Analytics
- Leverage NLP to help make sense of large volumes of unstructured text descriptions
- Identify related incidents using Incident Topic Clustering
- Benefit from using Topic Clustering for Smart Incident Management