This post is from the Numerify blog and has not been updated since the original publish date.
What Kinds of Data Should You Be Using to Reduce IT Operations Risk (Part 1)
Changes to the production environment can cause major incidents, disrupting critical business services. In fact, according to Gartner, 80% of major incidents are change-related. Organizations can proactively measure and respond to these potential risks using their data analytics and AI models based on their past track record. Sourcing the right data allows IT operations to track trends in ongoing service performance and anticipate emerging threats.
According to a white paper authored by Enterprise Management Associates (EMA) and commissioned by Numerify, "Continuous data collection is the foundation of any modern risk management strategy."
But how do organizations know what is the "right" data to source? What data sources should they use to ensure maximum visibility of potential problems and the accuracy of risk predictions?
By linking IT analytics to the applications, systems, and services used daily in your organization, you can source rich data that can help you predict, and possibly even prevent, major disruptions to your most critical business services.
Identifying the Right Data to Collect
The most common challenges to getting an analytics solution to a state where it can provide value include deciding:
- How to identify the right data to collect
- How to collect the data
- How to intelligently manage the data
- How to access the data
Best practices for data sourcing will vary from organization to organization, depending on the activities and processes they use. The "right" source will also vary depending on the types of problems IT wants to anticipate and mitigate — your individual use case.
However, most IT operations looking to reduce production risks will want to use data that can tell a story about operations and customer experience. They can analyze this data to trace patterns of cause-and-effect for certain changes to detect when favorable conditions for a change-related major incident are present.
Common data sources that fit this bill include:
- End user-generated tickets
- Problem logs
- Alerts or auto-generated tickets from performance monitoring systems
- Configuration Management Database (CMDB) changes
- Metadata for unclosed tickets, reassignment activities, etc.
- Major incidents
- Change advisory board (CAB) logs/reports
IT teams may also want to look at data sourced from the development environment since particular types of changes can be traced back to development decisions. Development environment data can be used to draw correlations between certain types of changes or activities that tend to cause issues later in production.
Unstructured data also provides contextual information and the potential to discover patterns that often go overlooked. For example, Natural Language Processing (NLP) can be applied to auto-link changes to incidents using factors like sequences of common words, CI or application names, and proximity in time.
Best Practices for Extracting IT Data Across Key Systems of Record
Identifying the data sources you need to use is one step; bringing them into your IT analytics solution is another.
Prebuilt application integration and adapter tools simplify the process of gathering data from common sources. Using a proven solution with prebuilt adapters enables automated data discovery, providing fuel for a continuous data pipeline that offers regular updates and a near real-time health report of your production environment.
Solutions that have prebuilt adapters for key applications can offer scalability, improve agility, and accelerate the time it takes to extract value from a solution. Overall, this reduces risk, expense, and the project time needed to make data analytics functionally operational.
Using the Right Data to Fuel IT Business Analytics and Anticipate Risks
The resulting data can be visualized and analyzed in a dashboard tool, revealing the activities closest to the site of change-related problems and potential future risks. Compiling data in this way not only provides easy data access, it can also offer insights at a glance.
AI models can identify thresholds that can be used to highlight metrics that have drifted into a range that indicates unfavorable pre-conditions for a major incident. Users should be able to drill down into these highlighted metrics to see more granular activities associated with it, such as whether certain operations teams are contributing a large amount of risk by having an extensive unclosed ticket backlog.
These Machine Learning (ML) tools can discover new relationships between incident and change activities that had gone unconsidered. For example, an ML engine can discover how changes related to a certain app function, like Extract, Transform, Load (ETL) from a plug-in, have a high tendency to cause incidents. Or, it could reveal how certain teams or even individuals make changes that have the highest risks.
These capabilities improve decision-making, allowing CABs and IT operations to move from reacting in the face of costly, stressful incidents to proactive decision-making. Mitigation and avoidance become more common, reducing the frequency and impact of change-related problems overall.
All of these benefits have the effect of making organizations more efficient, agile, and resilient in the face of possible risks.
Learn more about how you can make changes less likely to cause incidents by reading our white paper authored by EMA: "Change Risk Best Practices: How to Significantly Reduce Change Risk in Production Environments"