Embracing Downtime: Why 99.999...% Availability is Not Always Better
A couple of weeks ago, my ever-active colleagues Marco Mulder and Serge Beaumont organised an nlscrum meetup about "Combining Scrum and Operations", with presentations by Jeroen Bekaert and devopsdays organiser Patrick Debois. Unfortunately, I was late and only managed to catch the tail end of Patrick's well-delivered talk explaining how Dev/ops can become Devops. Thankfully, the lively open space discussions that followed provided plenty of interesting insights, comments and general food for thought. One recurring theme that particularly struck me was the comment, uttered with regret by many in Operations, that they would very much like to help and coordinate with the development teams but inevitably were always too busy keeping the production environment up and running. In other words, helping prepare for new releases might be desirable, but achieving the five nines, or whatever SLA Operations has committed to1, will always be paramount.This is a fallacy! Indeed, one of the core realisations of the "Devops mindset", to me, is that 99.999...% uptime is not an end in itself, but a means to an end: delivering the greatest business value possible. And aiming for the highest possible availability may not be the best way to go about it!2For instance, imagine a day's downtime in production costs $500k, and you have a new feature coming up for release that is estimated to bring in an extra $1m per day. Then for every day by which you can speed up the release you can afford almost two days of downtime!3The point is: the ability to maintain a stable current environment cannot be considered independently of the ability to rapidly deliver change. Rather, they need to balanced against each other to determine which combination will likely deliver greatest value. This is a decision only the business owner or customer can make. And naturally, the balance needs to continuously monitored and updated in light of new requirements and experience. There is a residual belief that the the tasks and responsibilities of developers and Operations are sufficiently different that they can't possibly benefit from each other's input. But whether it's the effects of placing nodes of a distributed system in different segments of the production network, or how the sharding and replication strategies of the database affect query performance, or even just knowing which version (and vendor!) of the JVM and container will be supported in production when the application goes live4 - developers need Operations input, and the earlier, the better. And only developers can add the internal health checks, debugging and tracing information, integration points for monitoring tools etc. that can mean the difference between a five minute fix and a week's frustrated log trawling for the support team. It's revealing to see how quickly this crucial, yet often neglected feature of an application is improved if developers are also responsible for support - generally, the first callout at three in the morning makes a world of difference.5It goes without saying that the acceptable balance between stability and change will differ from customer to customer, and from application to application. Globally shared infrastructure can cause problems here, because it's hard to be able to meet the requirements of the most demanding application without forcing all the others to pay the price. In other words, modularity is an important goal architecturally, and if you're interacting with shared infrastructure it should be tunable to your requirements. Amazon's Dynamo and, indeed, most of the cloud and distributed platforms out there exemplify this trend. But I'd like to defer a detailed discussion of the technical implications to a later blog6. My colleague Robert van Loghem and I will also be talking about this and related topics in our upcoming webinarplug!. Going back to the nlscrum meetup, the takeaway message for me was clear: setting up two independent entities, Development and Operations, giving them opposing goals (delivering change on the one hand, ensuring stability on the other) and expecting them to fight it out when the inevitable conflict happens is not the way to best deliver business value. We should be looking to organise our teams and activities to deliver the balance between new features and running systems that is most appropriate for a given application. And we can only do that if we first go to the customer, explain that there is a trade-off to be made and work together to make it!Addendum: in the unexpectedly long time it's taken me to finish off this post, my colleague Gero Vermaas described a client scenario that featured a real-life version of this challenge. It's good to see the client finally came round to accepting the concept, hopefully with the expected positive results!Footnotes
- Too often without drawing on actual day-to-day experience, a point made by Patrick.
- Of course, rushing inadequately tested, unstable software out just to release a feature on a certain date usually isn't a good way to go about it, either. This post is not supposed to be "Ops-bashing"; it's just that reducing the "feature frequency" is far less controvertial, in most organisations, than even considering reduced stability.
- The relative magnitude of the two figures is not particularly realistic, for sure. It's just for example's sake.
- Don't laugh! I've seen it happen too often, to clever and experienced developers, to believe this only an isolated problem.
- Quite a few big companies are adopting this model for all their applications. A number of attendees at the nlscrum meeting also reported positive experiences with this approach.
- Or even "blog series", who knows.