This post is from the Numerify blog and has not been updated since the original publish date.
Mastering Change with Machine Learning
Change success is a critical concern for IT leaders, because both the frequency of changes and the consequences of failed changes are greater than ever. As interactions with customers and business partners migrate from personal to digital, the business demands more rapid and frequent changes. Digitization is likewise raising the stakes for getting changes right, as the consequences of failed changes are painfully visible to customers and the marketplace. Add to these points the fact that cybersecurity requires frequent and urgent patches, and it all amounts to tremendous pressure on IT leaders to improve change success. What's Wrong with Today's Approach to Change? Change managers face a problem that's very similar to the challenge advertising professionals have grappled with for more than a century. John Wanamaker, one of the kings of retailing in the late 1800s, summed it up in this famous quote: "Half the money I spend on advertising is wasted; the trouble is, I don't know which half." The Googles, Amazons, and Facebooks of the world along with many others have invested billions in machine learning and other forms of artificial Intelligence to solve that very challenge – we're all familiar with ads popping up that magically reflect our online activity. They are all aiming to improve the percentage of advertising exposures that result in a purchase. The corollary for change managers is this: I know that 10 percent of my changes fail, but I don't know which 10 percent until it's too late. Just as advertisers have to make difficult decisions about where to spend their scarce advertising dollars, change managers have to make difficult decisions about where to spend their limited time and budget. There are so many things you could potentially focus on to improve change success that it's very difficult to know where to start. Examples include working on change process and governance, staff training, analyzing the root cause of unsuccessful changes, improving testing, implementing tools to automate change deployment, the list goes on and on. Faced with a list of things you could do to improve change failure that far exceeds your capacity, you do what you can with your limited time and budgets, guided by experience and intuition. Typically reductions in the change failure rate are minimal – if they are measured at all. A Modern Machine Learning-Driven Approach to Change A modern machine learning-driven approach to change success has four key interdependent elements:
- Automatically Uncover Key Risk Factors from Your Historical Change Data: We typically extract or derive more than 100 data attributes related to historical changes from modern service management systems. The beauty of machine learning algorithms is that they automate the task of sifting through so many variables to identify the few that are significantly correlated with change failure, either on their own or in combination. We typically see 10 or fewer significant variables.
- Predict Which Changes Are Most Likely to Fail: The machine learning model can then evaluate changes in the queue to determine the historical failure rate for changes with a similar profile, and that gives you the predicted probability of failure for the new changes in queue. This allows you to make smarter decisions on where to focus management attention, subject matter expertise, and governance to prevent high-risk changes from failing.
- Prevent Change Failure Across Your Organization: While it's great to be able to "rescue" high-risk changes one by one, it's much more effective to prevent change failure by removing risk factors or developing more effective risk mitigations at the organization level by making improvements to processes, people, and/or tools. To use a healthcare analogy, it's much better to prevent strokes by managing your patients' blood pressure with medication than to save them in the ER after they have the stroke.
- Monitor for New Threats to Change Success: The risk factors that threaten change success are constantly evolving as your IT processes, people, and technology evolve. Therefore, you need a way to constantly monitor your change data for emerging risk factors, and the machine learning algorithms that we use to uncover the original risk factors can be automatically run to detect those emerging risk factors and give you early warning.
Automatically Uncover Key Risk Factors from Your Historical Change Data While machine learning is best known for generating predictions, certain methods can also automatically uncover the key risk factors that drive adverse outcomes such as failed changes. Stepwise Logistic Regression and Decision Trees are two examples of machine learning methods that can uncover risk factors in addition to generating predictions. These algorithms can automatically search through dozens or hundreds of data attributes to find the select few that are significantly correlated with change failure, either individually or in combination. Here are a few examples of change risk factors uncovered:
- "Change implementers with fewer than 10 prior completed changes experienced failure rates three times higher than those with 10 or more completed changes when the changes required downtime."
- "Cloud Services changes failed at four times the rate of other changes."
Predict Which Changes Are Most Likely to Fail Embedding change failure predictions into your operational change process ensures that they are available when and where change managers and implementers can make the best use of them. The figure below shows an example of how your existing change queue dashboard could be enhanced to display the failure probability and risk factors alongside your normal operational change data.
Risk Level (Failure Probability) The machine learning model will generate a failure probability prediction for each change that has not yet been implemented, and since this is automated you can update these daily or as frequently as you want to. This is just like a weather report informing you that there's a 30 percent chance of rain later today. We recommend converting the raw failure probability prediction to risk level buckets as shown in the figure, such as Low, Low-Medium, Medium-High, and High. The key reason for bucketing the change failure predictions is so you can provide your change teams with standard guidance on how to handle changes in each risk level. Many organizations choose to enhance their standard risk assessment process using the machine learning failure probability prediction as another input that enhances the standard risk assessment. Change Risk Factors Risk factors are particular values of change data attributes that the machine learning model finds to be most strongly correlated with change failure. Change teams will find it much easier to take action when they know which risk factors are driving the higher failure probability, so they can focus their risk mitigation efforts on those factors. Examples of risk factors illustrated in the figure below include:
- CI Class is "Oracle Server" or "Network Gear"
- Planned Start Time is in the 4 am or 5 am window
- Type is "Standard" and Impact is "Low." Note that risk factors can be combinations of values for two or more data attributes.
Prevent Change Failure Across Your Organization Getting better at preventing individual changes from failing is great, but it's much better if you can prevent change failure by targeting the root causes at the organization level. To achieve that, you need to have clear visibility into the change risk factors and how they interact. A Change Failure Risk Management Dashboard like the example shown below provides the visibility that you need. In this example, each risk factor data attribute is represented with a heat map where the risk factors are shaded in pink or red, with deep red representing the highest change failure rates and larger boxes representing a greater proportion of the historical changes. The dashboard is interactive, allowing you to filter and click on one or more values within the heat maps to see how various combinations of risk factors impact the overall failure rate.
Here are a few illustrative examples of actual change risk factors and the organization-level preventive actions they enabled.
Monitor for New Threats to Change Success These machine learning algorithms can automatically detect new and changing risk factors to help you stay on top of new risk factors that will emerge as your people, processes, and technology rapidly evolve. We recommend monitoring change risk factors monthly. This will enable you to promptly spot new threats to change success and respond proactively.
Learn more about "Mastering Change with Machine Learning" in our exclusive webinar.
Numerify's Director of IT Business Analytics Services, Joe Foley, discusses how a machine-learning driven approach to change can prevent operational disruption, reduce the cost of recovery, and enable organizations to focus on growth and innovation. [Photo credit: Pexels]