Why Pass Rate Isn’t a Release Signal

It’s Thursday afternoon. Release is scheduled for Friday morning. 

You open the dashboard. 94% pass rate. The failing tests are ones you’ve seen before – intermittent, usually green on a rerun. You rerun them. Green. You sign off. 

Saturday morning, your phone buzzes. 

Users are reporting random session logouts. Engineering is in a war room. The root cause: a race condition in the auth token refresh. The same instability a flaky test had been flagging for four straight sprints – while the team learned to rerun until green and move on. 

The signal was there the whole time. You just weren’t reading it. 

The metric everyone tracks. The question it can’t answer. 

Pass rate is the most common metric in mobile QA. It’s also one of the least useful for making release decisions. 

A count of 847 passed and 23 failed tells leadership nothing about risk. It doesn’t reveal whether those 23 failures touch a critical payment flow or an obscure settings page. It doesn’t show whether failures are trending upward across releases or represent isolated regressions. [1] 

A 94% pass rate is not a release signal. It’s a count. And counts without context don’t answer the question that matters before every release: where is risk concentrated in this build? 

One failure in a payment confirmation flow is not the same risk as one failure in a rarely-used edge case. But pass rate treats them identically. [2] 

If that didn’t ring a bell, these will 

That opening story is one version of a pattern that plays out in different forms across almost every mobile team. The details change. The root cause doesn’t: the signal was there, and nobody had a system for reading it. 

Here are two more you’ve probably lived through. 

The failure nobody noticed because the number looked fine 

Sprint 6. A single test covering the payment confirmation screen starts failing. 1 failure out of 847 – 99.8% pass rate. It looks like noise. It ships. 

Three days later, users report they’re not receiving order confirmation emails. A downstream integration broke silently, and the one test covering it was the one everyone skipped over because the overall number looked healthy. 

The problem wasn’t a missing test. It was that pass rate made a high-risk failure invisible by averaging it into a reassuring number. [1] 

The slow drift nobody flagged 

Over four sprints, pass rate moves 97% → 96% → 94% → 93%. Each individual drop looks like normal variance. No threshold, no trend line, no alert – just a number in a weekly report that nobody compares to last sprint’s number. 

By the time it hits 93%, there are 11 new recurring failures spread across three feature areas. None individually alarming. Together they signal something systemic is degrading. 

The dangerous version of this isn’t a sudden drop. It’s a slow drift that only becomes visible when you trend it over time – not read it as a snapshot. [4] 

What these scenarios have in common 

None of them are testing failures. The tests ran. The suite executed. The data was there. 

They’re all insight failures. The team had the signal and no system for reading it. 

What happens next is predictable: developers stop reading failure logs carefully. QA leads stop triaging every red build. Real failures get dismissed alongside the flaky ones. This isn’t negligence – it’s a rational response to a system that produces too much noise and too little signal. When every failure looks the same, teams stop looking closely at any of them. [3] 

The right question before every release 

Most pre-release conversations center on “did the tests pass?” The question that actually predicts release safety is different: where is risk concentrated in this build? 

Answering it requires context that pass rate doesn’t carry. 

Failure location, not just failure count. A failure in a critical user journey is categorically different from a failure in a rarely-used edge case. What matters is which failures, where they live, and whether any are sitting on paths real users will hit. [2] 

Trend over time, not just today’s snapshot. A 93% pass rate means something different if last sprint was 97% than if it’s been holding steady at 93% for six builds. Build-over-build stability is a far more reliable predictor of release readiness than any single run – but only if someone is actually looking at the trend, not just the number. [4] 

Flakiness as a first-class signal, not noise to suppress. Flaky tests erode trust in automation. When failures seem random, teams rerun pipelines, ignore signals, and slow releases. The teams that treat flakiness as structured data – classifying failures, trending instability by test area, asking why a test keeps alternating – are the teams that catch race conditions before they become production incidents. 

Failure clarity before shipping. A meaningful release decision requires knowing not just that tests failed, but whether those failures have a clear root cause, whether they’re new this build or recurring, and whether they sit on a path real users will hit. 

Why mobile makes this harder 

On mobile, the signal-to-noise problem is more severe than on web. 

Device fragmentation, app store review delays, flaky networks, background execution limits, and OS-specific behavior mean quality problems often surface outside the lab. A green test run before submission is useful – it’s not sufficient. [5] 

A suite that passes on your lab devices tells you almost nothing about what users on different hardware, different OS versions, and different network conditions will experience. Without overlaying failure patterns against the environment combinations that actually matter, pass rate doesn’t just obscure risk – it actively misrepresents it. 

The same test can pass on one device and fail on another with an identical build. Environment variability – OS state, locale, network conditions, device history – means identical test code produces different outcomes for reasons that have nothing to do with your application. Without visibility into what the environment was doing during a run, a green suite is not a guarantee. It’s a guess that happened to come out right. [6] 

What to demand from your test analytics 

The goal isn’t just automation. It’s high-confidence shipping – where every release passes through checks that are meaningful, trustworthy, and mapped to what users actually do on their devices. [7] 

Getting there means treating test data not as a pass/fail ledger, but as an analytics problem. It means building – and demanding – tools that surface failure concentration, stability trends, and flakiness patterns in a single view, before you make the call to ship. 

Most teams are one step away from that. The data exists. The signals are in the logs. What’s missing is a layer of analytics that turns execution output into release intelligence. 

The next generation of test analytics tools needs to be built around a single question: not “how did the suite perform?” but “where should I be worried about this release?” 

At Digital.ai, that’s the problem we’re working on. We don’t think the answer is more tests. We think it’s smarter signals – and the discipline to ask a harder question before every release. 

Digital.ai Testing gives mobile teams the execution infrastructure to run tests at scale. The analytics layer that turns that data into release confidence is what we’re building next. 

Sources 

  1. Virtuoso QA – What is a Test Report? 
  1. Testlio – Mobile QA Metrics 
  1. CloudBees – The Flaky Test Confession: Ignoring Test Failures 
  1. dev.to / Sophie Lane – Regression Testing Metrics That Actually Indicate Release Readiness 
  1. Capgo – App Quality Assurance 
  1. QA Wolf – Why Mobile E2E Tests Flake and How QA Wolf Controls Every Layer to Stop It 
  1. QA Wolf – Guide to Automated Mobile App E2E Regression Testing 

You Might Also Like