This post is from the Numerify blog and has not been updated since the original publish date.
Using AI to shift from reactive to proactive major incident management
What Predicting Tornadoes and Major IT Incidents Have in Common
When the weather turns bad, the signs that there's a tornado on the horizon are always ominous — but they're rarely obvious.
"The sky turned the weirdest color of gray I'd ever seen," Mark Ausbrooks, survivor of a 2014 tornado in Mayflower Arkansas, told NBC News. "You always hear how still it gets, and there was not a leaf moving."
An average person with no knowledge of meteorology can pick up on these abnormalities, but they will have difficulty processing this information into a clear message: "DANGER AHEAD! TAKE COVER."
In the same way, telltale signs of an impending major IT incident may be everywhere, but they will be ignored if they are not assembled together in a way that can indicate and anticipate risk.
What IT needs to predict, and possibly avert, these incidents is a system like the one the National Weather Service uses to predict and alert people to possible severe weather activity. These systems don't just look at one factor. Instead, they assemble a set of all known risk factors to put together an overall picture of risk probability.
Meteorologists know to look at geography, time of year, presence of thunderstorms, barometric pressure and trend, level of moisture at low to mid-level altitudes, and the presence of updrafts within their tornado prediction models. If these elements reach a certain range, they create favorable conditions for a tornado.
Authorities look at this risk analysis and determine whether to activate a watch or a warning and possible evacuation orders. Furthermore, the model enables authorities to localize the risk and target their preventive actions.
A new AI-backed system created by Numerify brings these same capabilities to IT organizations, allowing them to respond to potential impending disasters before they are able to cause major disruptions — and damage.
Introducing a New AI-Based System for Major Incident Risk Prediction
Today, we're officially launching our Major Incident Risk Prediction Engine to help organizations predict and prevent service disruptions using, essentially the same principles the National Weather Service uses to predict tornadoes. Our engine combines known major incident risk factors and assembles them into a model that can indicate favorable conditions. Not only that, but it can also predict the localized impact of possible incidents given these conditions and issue an appropriate advisory for risk mitigation.
This new capability is included in our Service Management Process Optimization solution. It provides IT executives with unprecedented visibility and actionable insights into their service management processes. It does so by incorporating the same proven principles we first introduced in the Change Risk Prediction solution, which has already delivered millions of dollars in savings to a range of organizations across industries.
As IT organizations evolve to a fast-paced DevOps oriented model, a key challenge that confronts them is addressing the scale and complexity of incidents affecting IT services and infrastructure. Gartner estimates that the cost of downtime is well over $300,000 per hour. Furthermore, a research report by Quocirca suggests that duplicate and repeat incidents are a pervasive and persistent problem.
Most organizations employ a reactive approach to major incident management. The goal of such an approach is to restore business services as soon as possible, and it relies on reducing the Mean Time To Detection (MTTD) and Mean Time to Resolution (MTTR). A post-incident problem process is used to identify and permanently remediate root cause.
However, organizations are forced to bear the brunt of negative consequences before they can begin their response. In turn, IT leaders increasingly recognize the limitations of such an approach. Quocirca research suggests that 80% of organizations feel that their MTTD for incidents could be improved.
A proactive approach to major incident management has much more promise and leverages recent advances in Artificial Intelligence (AI) and Machine Learning (ML). The primary objective of this approach is the early detection of potential risk. It relies on identifying known risk factors for the organization based upon historical events using machine learning models. These models improve their predictive capabilities over time, drawing stronger correlations between the risk factors that have demonstrated the most predictive potential.
How AI and Machine Learning Models Can Predict Possible Major Incidents Before They Have an Impact
Organizations can leverage AI to monitor for the presence of problematic combinations of known risk factors. Organizations then benefit from an early-alert system for major incident risk, allowing them to proactively recognize upcoming periods of high risk. This "early alert" puts their organizations into a good position to minimize or eliminate risk and be ready to rapidly address any incidents.
The benefits of a proactive incident management process are numerous and measurable. It can:
- Minimize the impact on business operations and customer experience
- Empower IT to deliver new capabilities on schedule
- Improve IT and business reputation for reliability
- Reduce overall service costs
Every proactive risk prediction model should have three core functions:
- Identify common risk factors leveraging machine learning or other advanced analytic techniques,
- Monitor for these risk conditions operating using an artificial intelligence model, and
- Visualize findings and notify key parties of the potential risk and predicted impact when a risk threshold is reached, or high-risk events are predicted.
These functions are essential for not just identifying potential risks but also putting IT teams in a position to preemptively spring into action to address possible major incidents before they begin to have a devastating impact.
A major incident risk prediction model accounts for various factors like:
- Past major incident volume
- Problem backlog
- Planned change activity
- Historical trend of time between major incidents
- Days since the last major incident
- Time of week and month, average problem age
- Minor incident growth rate
The model learns what attributes are the strongest indicators of major incident risk and can, therefore, indicate the level of risk as well as the drivers behind that risk level.
For instance, the model might learn that the risk is heightened when minor incident volume rises 15% above the mid-term trend line. This AI-based analytical model monitors the risk factors for all applications on a daily basis and calculates a composite risk score for each application based on current conditions.
Application owners can be notified when their application reaches favorable conditions for a major incident to occur. Then, they can drill into the specific risk factors driving up their composite risk score and take steps to understand and mitigate the risk.
By understanding the specific risk factors, the application support teams can investigate underlying issues that are increasing current risk. IT leadership may decide to issue a change freeze for the specific applications at risk until mitigation actions are taken.
A Necessary, Proactive Approach to Preventing Disastrous Business Disruptions
The threats of both tornadoes and major IT incidents are too real to simply just react to them once they have already begun to cause havoc on the ground. IT teams can and must be prepared in advance. Major Risk Prediction systems are the tools they need to protect elements of essential business value — rather than picking up the pieces of what remains after an incident has already blown through their organization.
IT operations systems and processes continually generate a rich set of data, but IT organizations often lack an analytical lens to convert it into rich insights. IT leaders can leverage AI and ML models for the ability to be proactively ensuring business service stability. These models can analyze relevant data to identify patterns that highlight which applications are at risk when a foreboding combination of conditions occurs.
Major incident and change risk prediction models act as good entry points for most IT organizations to begin the adoption of AI and ML models to reduce risk and cost while delivering high-quality services to their business stakeholders.
Want to learn more? Watch our recent webinar explaining these systems: "How to improve major incident management using predictive analytics & AI"