This post is from the XebiaLabs blog and has not been updated since the original publish date.
How Many Test Failures Are Acceptable?
Continuous Delivery is getting a lot of mileage at the moment. It seems to be an idea whose time has come. There was a survey last year that claimed that 66% of companies had a "Strategy for Continuous Delivery". Not sure that I believe that, nevertheless it suggests that CD is "cool". I suppose that it is inevitable that such a popular, widespread idea will be misinterpreted in some places. Two such misinterpretations seem fairly common to me.
The first is that Continuous Delivery is really just about automating deployment of your software. If you have written some scripts or bought a tool to deploy your system you are doing Continuous Delivery - wrong!
The second is that automated testing is an optional part of the process, that getting your release frequency down to a month is a big step forward (which it is for some organisations) and that that means you are doing CD, despite the fact that your primary go-live testing is still manual - wrong again!
I see CD as a holistic process. Our aim is to minimise the gap between having an idea and getting working software into the hands of our users to express that idea so that we can learn from their experience. When I work on a project my aim is always to minimise that cycle-time. This has all sorts of implications, and affects pretty much every aspect of your development process, not to say your business strategy. Central to this is the need to automate, in order to reduce the cycle time.
The most crucial part of that automation, and the most valuable, is your testing. The aim of a CD process is to make software development more empirical. We want to carry out experiments that give us new understanding when they fail, and a higher level of confidence in our assumptions when they don't. The principal expression of these experiments is as automated tests.
The best projects that I have worked on have taken this approach very seriously. We tested every aspect of our system - every aspect! That is not to say that our testing was exhaustive, you can never test everything, but it was extensive.
So what does such a testing strategy look like?
The deployment pipeline is an automated version of your software release process. Its aim is to provide a channel to production that verifies our decision to release. Unfortunately we can never prove that our code is good, we can only prove that it is bad when a test fails. This is the idea of falsifiability which we learn from science. I can never prove the theory that "All Swans are white", but as soon as I see a black Swan I know that the theory is wrong.
Karl Popper proposed the idea of falsifiabiliy in his book "The Logic of Scientific Discovery" in 1934. Since then it has become pretty much the defining characteristic of science. If you can falsify a statement through experimental evidence it is a scientific theory, if you cannot it is a guess.
So, back to software. Falsifiability should be a cornerstone of our testing strategy. We want tests that will definitively pass or fail, and when they fail we want that to mean that we should not release our system, because we now know that it has a problem.
I am sometimes asked the question, "What percentage of tests do you think should be passing before we release?". I think that people think that I am an optimistic fool when I answer "100%". What is the point of having tests that tell us that our software is not good enough, and then ignoring what they tell us?
In the real world this is difficult for some kinds of tests in some kinds of system. There have been times when I have relaxed this absolute rule. However, there are only two reasons why tests may be failing and it still makes sense to release:
1) The tests are correctly failing and showing a problem, but this is a problem that we are prepared to live with in production.
2) The tests or system under-test (SUT) are flaky (non-deterministic) and so we don't really know what state we are in.
In my experience, maybe surprisingly, the second case is the more common. This is a pretty serious problem because we don't really know what is going on now.
Tests that we accept as "Oh that one is always failing" are subversive. First they acclimatise us to accepting a failing status as normal.
It is vital to any Continuous Integration process, let alone a Continuous Delivery process, that we optimise to keep the code in a releasable state. Fixing any failing test should take precedence over any other work. Sometimes this is expensive! Sometimes we have a nasty intermitent test that is extremely hard to figure out. Nevertheless, it must be figured out. The intermitency is telling us something very important. Either our test is flaky, or the SUT is flaky. Either one is bad, and you won't know which it is until you have found the problem and fixed it.
If you have a flaky system, with flaky tests and lots of bugs in production, this may sound hard to achieve, but this is a self-fulfilling approach. To get your tests to be deterministic, your code needs to be deterministic. If you do this your bug count will fall!
I read a good article on the adoption of Continuous Delivery at PaddyPower recently, (http://www.infoq.com/articles/cd-benefits-challenges) in which the author, Lianping Chen, claims "Product quality has improved significantly. The number of open bugs for the applications has decreased by more than 90 percent.". This may sound surprising if you have not seen what Continuous Delivery looks like when you take it seriously, but this is completely in-line with my experience. This kind of effect only happens when you start being aggressive in your denial of failure - a single test-failure must mean "Not good enough!"
So take a hard-line with your automated tests, test everything and ensure that a single failure means that your system is not fit to release.