John Allspaw joins us this week to talk about incident response, and helping organizations build their own NTSB (the National Transportation Safety Board, a US government agency that investigates transportation accidents).
John gives us an overview of what he and the other folks at Adaptive Capacity Labs are working on.
John talks about the state of the industry around incident response. Learning from incidents is happening; but are organizations supporting it? Are people finding it helpful? Expertise is coming from inside the house, in that software practitioners are getting better at coping with the complexity of the systems. Where there is still work to do is around how teams learn from their incidents and postmortems. Are the artifacts generated by these exercises used after they are created, or do they just become a museum to the incidents?
There is an emerging community of folks who are really enthusiastic about learning from other organizations’ incidents, in software and across different kinds of industries. But many incident reports are still written to be filed rather than written to be read, and leave out some of the other aspects of an incident that are important. In addition to just whatever triggered the incident, there are other aspects to be learned from, like weighing potential fixes, or finding information.
Thinking about incident reports as a story, what elements make it a good story? What did the team struggle with? What was hard about it?
A number of “safety critical” domains have a longer history than software development with respect to dealing with incidents. Some domains have different constraints, challenges for gathering data, legal ramifications. What will the future look like, and can software development avoid some of those constraints.
How do all of these potential incidents impact not just the employees on the teams managing the incidents, but also the public, consumers, and how are they impacted by an outage? Are consumers able to make informed decisions about a company based on how incidents are handled?
As a learning exercise, do your new employees take the opportunity to read past incident reports and then ask questions?
John deflects answering about Root Cause. You’ll just have to check Twitter.
An existing belief that leads people to a potentially incorrect outcome, is that an incident is seen by different people, with different perspectives, as different. There is not one true universal story that everyone will get from reading an incident analysis.
The goal shouldn’t be for just the person doing the analysis to understand what happened, but to also make future readers understand what happened.
John talks about the practice of software engineering, and the certainty that things has changed and will change. Everything should be up for questioning of assumptions about what is the best way for something to be done.
John talks a bit about incident command frameworks and refers to Laura Maguire’s research on the costs of coordination and how the costs associated with robust incident response are easy to forget. Laura’s talk is linked in the Additional Resources below.
John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.
John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.