Building an Incident Response Plan With John Allspaw

Posted on Wednesday, Aug 19, 2020

Mandi Walls talks with John Allspaw, Co-Founder and Principal at Adaptive Capacity Labs, about the practice of dealing with technical incidents.

Transcript

Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @lnxchk on Twitter. Today, I’m joined by John Allspaw, a co-founder and principal at Adaptive Capacity Labs. You might know John from the famous Flickr talk from the Velocity Conference years ago, if you’re not totally new to DevOps, but he’s working on some great stuff now. So, welcome to the show, John. You want to fill us in on what kind of stuff you’re doing these days?

John Allspaw: Thanks for having me, Mandi. I’m excited to talk about some of these things. A thing that we’ve been doing with Adaptive Capacity Labs, we’ve been doing [inaudible 00:00:56] oh, about, going on three years now, is pretty simple, actually, simple to say, not super simple to do, which is helping organizations make progress learning from incidents. And we do that a bunch of different ways, but for the most part, the part that’s most rewarding for us is to help coach and train groups on, one way of saying it is sort of like having their own internal NTSB. Analyzing incidents, we have a strong stance that analyzing incidents goes much further, it goes much beyond, to do it effectively than your standard template. So, that’s what we’ve been doing and it’s been a lot of fun.

Mandi Walls: Cool. What kind of things are you’re finding when you’re working with folks? What’s the state of the industry right now? Are folks already have a practice in place that you’re improving, or are you helping folks get started from zero?

John Allspaw: So, the short answer is sort of equal parts exciting and worrisome.

Mandi Walls: Oh, dear.

John Allspaw: We’ve seen a number of organizations, all the way from smaller, I wouldn’t say early, early stage, but certainly smaller established startups, all the way to tens of thousands of staff on multiple continents types of organizations. We see all sorts of things in the details, but as far as patterns are concerned, I’d just say that it’s slightly worrisome in that learning from incidents is happening, as my colleague Richard Cook would like to say, is you can’t actually get people to not learn. The question isn’t whether the learning’s happening, it definitely is, the question is whether you’re supporting it. And what people learn is always an open question. Sometimes what people learn is going to post-mortem meetings is a huge waste of time and it’s a “check-the-box” sort of chore. And so it’s exciting mostly because we’re hoping there’s a, I would say, a perspective shift. What we like to say is, “The expertise is coming from inside the house.” We’re much better in the software engineering, and I say that in the broadest sense, industry, the community of practitioners. We’re much better at coping with these increasingly complex systems that we design. The issue is that how we understand them, how we, what we’ve said is sort of recalibrating our understandings as incidents arise, it’s not well-supported. It’s not well-captured. It happens almost tacitly. It’s definitely happening, but it’s pretty sort of narrow and localized. And so we find a lot of organizations run through what the typical or sort of conventional approach is to a post-mortem. They do the things that organizations have been writing about for a while now. But what we find is it’s captured in a way that makes it pretty difficult for anybody who wasn’t there and involved with the incident, or at the very least familiar with some of the esoteric details about the technology that’s involved, it’s pretty difficult to get an understanding of the real substance of it. And we know this is because engineers don’t go seeking out reading your typical post-mortem report. We always ask potential clients, “How do you do post-incident activities?” And they say, “Oh, well, we do this, and we have a post-mortem meeting.” And then we say, “Well, do you write these up somewhere?” “Oh, yes, of course. You never want a good crisis to go to waste,” and all sorts of sort of cliched, banal sayings. And they say, “Oh, yes. And we keep that in our wiki,” or Confluence or Google Docs or whatever. And then we always ask, “Who reads them?” And thus far, we don’t get many answers other than, “Oh, that’s a good question. I’m not entirely sure,” which leads you to the understanding of, well, if no one’s reading them, who’s learning?

Mandi Walls: Yeah. Do you see that, too? Of course, there’s a group of us nerds who will read anybody’s post-mortem when they put it out just for our own edification. But do you see many of those folks out there? Any kind of horror story, we love to read some of that stuff when it gets posted, but…

John Allspaw: Yeah. Back to sort of exciting and worrisome, the one thing that is exciting is that there’s a pretty fast-growing community looking to understand how to do this more effectively and borrowing, and really enthusiastic, diving headfirst into understanding how different industries do this work, what those challenges are, what opportunities we have in software versus, say, aviation or medicine or nuclear power. And it is, it’s growing quickly, the Learning from Incidents sort of community started by Nora Jones, I think sometime last year, the learningfromincidents.io site, this is a quickly growing group and that’s really exciting. One of the most significant differences, and it kind of explains why these things are more often than not written to be filed versus written to be read, is that for the most part they’re sort of “follow the template” [inaudible 00:06:15] Madlibs way. You’ve got a handful of sort of freeform text fields that you can sort of fill in. But it doesn’t capture what makes the incident difficult. And without that information, we want to know about red herrings that people followed, we want to know about what was surprising, what was surprising to some people but not others. We want to know what was difficult in handling the… Was it understanding what was happening that was difficult? Maybe it was pretty straightforward to understand what was happening or easy to get some confidence about what was happening, but more difficult to do something about it, or more difficult to weigh between a couple of options, all of which have some potential downsides if you were to follow them, that sort of thing. And this is what makes for stories. This is what makes for stories. When we talk with organizations, we almost always want to talk straight with engineers, we don’t really waste too much time on talking to technology leaders. And I say that as a former CTO. No. They are are distanced, right?

Mandi Walls: Sure. Absolutely.

John Allspaw: They’re distanced from the hands-on work. And when we talk with engineers, we ask them pretty open-ended and say, “Tell us about an incident. Tell us about an incident that comes to mind.” We don’t give them much more prompt than that or a criteria. And what comes to mind are good stories. What makes a good story? Well, you can tell, even if this engineer doesn’t exactly identify as being a storyteller or a good writer or anything like that, they’ll tell the story because it makes for a good story. And what the elements of a good story is to include struggle and difficulty. It’s as straightforward as people remember. Things stick out for people when elements and qualities of the narrative involve descriptions of what people found difficult. And other engineers are, and Mandi, you and I have known each other for a long time, there’s how many sort of late nights at the Velocity Conference around the bar are people telling these types of stories to each other?

Mandi Walls: Absolutely. I haven’t been on-call and in 10 years and I still have anxiety dreams about some of the incidents that happened when I [inaudible 00:08:36].

John Allspaw: Yeah. And then, again, to sort of say the perhaps obvious thing is that if you don’t remember something, how can you say that you’ve learned it? But then, of course, that’s the difficulty, and this is effectively the business model of Adaptive Capacity Labs, which is you have a finite amount of time to produce a thing that you want to have the greatest depth and the greatest accessibility or the broadest audience. And so that’s where learning these skills are, which involves interviewing and analysis of what people have written, and what people have said before, what past incidents that have connections to the incident you’re looking at, all of those sorts of things.

Mandi Walls: Absolutely. So, you mentioned NTSB, aviation medicine, nuclear power, these industries that have a much longer history than sort of what we’re looking at right now in software engineering. What are we learning? There’s a lot higher stakes. And I’ve read some of the human error books and some of the other things. And don’t read them when you’re on an airplane, that’s trauma-inducing for everyone around you. But what are we pulling in? What are you looking at to borrow from some of these other places? And your partners all have expertise in some of these areas that they’re sort of bringing to bear as well. So, what are we bringing in?

John Allspaw: So, my answer to this question is, I’d say different than it was, oh, even about five years ago. So, you’re absolutely right. Starting with the fact that these other quote unquote “safety-critical” domains have a much longer history, it also means that they’ve adapted and certainly responded to really notable accidents in the past in ways that sometimes are productive and sometimes aren’t productive. And so at a high level, understanding how, for example, reading how nurses detect issues in a neonatal intensive care unit, where the patients actually don’t speak English because they’re babies. They can’t tell you, you can’t ask them what’s going on with them and “How are you feeling?” and that sort of thing. And so stories in that environment makes up a huge part of Gary Klein’s career, who studies naturalistic decision-making under pressure. And so these are all cognitive work problems. They’re all people attempting to cope with uncertainty and ambiguity and complexity when there are no easy answers. So, from a high level, there’s a boatload. And of course, you have to simultaneously read about these things, but at the same time, you’re making connections in the software world. The things that we’ve got an opportunity to sidestep, there’s some barriers to learning from incidents in those other domains that, hopefully, we can sidestep. For example, accident investigation in some of these other domains isn’t exactly as straightforward as you think in the way that getting data, getting information, from people who were there can be difficult. There are some domains where you’ll say, “Oh, can I talk to you about this incident? Can I talk to you about this accident?” and that sort of thing and the practitioner, whether it’s a doctor or a power plant operator, might, and quite often do, respond with, “Oh, sure. Let me just make sure that I have legal counsel,” or “Let me make sure that I have my union rep with me.” And so the good news is that software doesn’t have that sort of onerous regulation situation outside of sort of legal frames. But as J. Paul Reed said when he started REdeploy, one of the reasons why he wanted to, and I’m in firm support of, talking about resilience and talking about these topics, at least via the conference he created, was, look, there’s going to be a future where it’s possible that decisions like, well, you need to have some sort of “cover your ass” protections in place before you’ll say, “Oh, yeah. Here’s how I saw this outage go down.” And the question is, and what he said at that first Redeploy, was, are we going to, we in the industry, the tech industry, want to describe a future that we want and we think is productive or other people will decide for us. And so, yeah, that’s just, I could go on and on. There’s differences and similarities, it just requires translating and bridging. And if there’s one thing that I would want listeners to think of, is it’s not sort of a wholesale taking of a practice and just blindly applying it.

Mandi Walls: Sure. Yeah. You can’t [inaudible 00:13:31] all that stuff in. And part of it, too, thinking about what happens in public versus we have a plane crash or a nuclear meltdown, or some of these big high-stakes, very public, very broad featured things that are going to end up in the news for weeks at a time as they sort of trickle out. And then even some of the biggest security breaches and things like that, they’re out of the news cycle fairly rapidly. And sort of the average person I feel like doesn’t have, the consumer of the news, basically, isn’t as invested in what’s going on with a security breach and doesn’t have the personal connection to it the way they do, say, a crash [inaudible 00:14:19] some other major disaster that way. And as things go on and the rest of our lives become more and more integrated with computation and everything else that’s going on in my super computer in my pocket and all that stuff, do you see people changing the way they react to these things? I feel like now things are kind of blase, and unless you’ve had your identity stolen people are kind of sloppy about where they put their data, and the impacts of some of these other outages and things that happen aren’t part of their day-to-day discourse.

John Allspaw: I think I agree. It’s a good question. I spend most of my time not really thinking about sort of the broader audience [inaudible 00:15:03] consumers, but more about even people who have an opportunity inside an organization, inside of Twitter, inside of Facebook, inside of Etsy, inside of the real sort of messy guts of these services, and how it’s quite difficult to ask somebody about an incident two weeks ago. And so the people who are closest don’t have very much insight to begin with. So, how could we imagine consumers being able to make informed choices from an outsider perspective? The one thing that this rings a bell for me is how quickly those of us in the tech industry will read or, and take, if you’ve read a blog post about an incident that happened at a different company, a company that’s not yours, there are certainly some organizations that are better at doing that than others. And some of the stories make a real compelling reading. I would imagine the outage CloudFlare had last year that the post they wrote surrounding the sometimes really surprising behaviors that a regex can produce, I’m sure it made excellent reading. And it made for excellent reading because I don’t know of an engineer who wouldn’t be able to see themselves in the words of [inaudible 00:16:36]. Now, but we have to remember, that’s not an analysis. When you see a blog post, it’s written for a very particular audience. There’s information and data that, even if they did include, wouldn’t make sense-

Mandi Walls: Sure. Absolutely.

John Allspaw: For us. And so there’s a whole bunch of detail and jargon. And so there’s sort of an outside perception, and not even driven by any sort of motivation to deceive or misdirect, although that absolutely does happen. Learning from an incident in the broadest way would mean that, necessarily, you would think that the internal, the writeup about an incident, has richer detail than the outward-facing one. And to assume that they need to be, or even could be, symmetrically identical is a thing that we sometimes, maybe a “grass is always greener” sort of situation, to assume that, “Oh, yeah. That’s what happened.” Well, I’m not entirely sure that all of the details are included there that are significant to the case. Sorry, I went off on a little bit of a tangent there.

Mandi Walls: No [inaudible 00:17:50] I’m sitting here thinking, okay, so, I’m pretty new to PagerDuty and I just went through our onboarding and a bunch of training and I’m like, yeah, bring your engineers in and here’s this opportunity for them to see how the organization functions." You read back through all the recent incidents and figure out how things get learned, and there’s a lot of rich material there for your folks to learn from.

John Allspaw: Yeah. That’s what you would hope, right? And we would say that, especially in the case of being new to an organization, this is a huge opportunity. And one of the things that we say with the companies that we work with [inaudible 00:18:31] at least starts out as a thought exercise, and amazingly, some have done pretty well sort of [inaudible 00:18:37] outside of a thought exercise, which is, can you take an incident, any incident, blindly reach into wherever you keep those things, hand it to Mandi, I’m using you as an example, and have her read it and then ask her what questions she has after. And the quality of those questions, are your questions as basic as, “Who wrote this?” “How many people were involved?” If there’s parts of the writeup that you don’t understand, that’s a place to sort of put effort. It’s an iterative process. If you have good questions, well, you’re probably… If you’ve got good questions like the people who were there, like, “Oh, that is a really good question. I don’t know,” then you’re probably in some good shape. If you’re asking some basic questions, probably got some room for improvement.

Mandi Walls: Yeah. Cool. All right. So, coming up to the last couple of minutes here, while we have some time, one of our sort of recurring little features is to debunk a myth. So, there’s plenty of things that are maybe misconceptions or myths or things around incident, incident response, incident learning, what’s a common myth or misconception that you might want to debunk? [inaudible 00:19:58] discussion about root cause or something like that that causes big Twitter threads when it gets out.

John Allspaw: Yeah, of course it does. So, I knew you were going to ask this question, and I was thinking about that. I’m going to just maybe sidestep myth about root because, frankly, I’m a little, I guess, somewhat exhausted from talking about it. Here’s, I think, a belief, it’s certainly a myth, but maybe kind of an implicit belief. And that is that you can take the story, an analysis, of these complex events that we call incidents and distill or sort of reduce it to kind of like pulling juice out of an orange with a syringe [inaudible 00:20:57] concentrated lessons. And here are the lessons, here are the lessons learned. And then that belief is pretty poor because different people will find new or different understandings depending on what their experience is, depending on where they sit in the organization. So, there’s not a one size fits all. We’ve extracted these quote unquote “lessons learned,” because when you believe that, then it seems as if the problem that you have or the challenge that you have is dissemination. That’s not what we see. What we see, and that’s not how learning works, different people learn different things in different ways and different times. If I were to give a really well-done incident analysis to 10 different engineers, some of which are close to the tech involved and some that are for more distant, I would want, expect, multiple people to find multiple things of interest. Surprising insights, that sort of thing. Sometimes one group would say, “Wait, you didn’t know that? I thought that everybody knew that.” And so that’s the myth, that somehow there can be the one true canonical story and that everybody will equally get everything out of it that the author intends. That is, I think, is one of the sort of more devious, or that’s sort of maybe an implicit belief that people have.

Mandi Walls: Yeah. It’s very, very tempting to go searching for that when some of the things happen.

John Allspaw: Yeah. Yeah. The last things that I would say on this is that for incident analysis to be effective, the job is not for the person analyzing the incident to understand the incident, just themselves. The job is to understand how others understood the incident. If you think about incident analysis as discovering what people will find interesting to read about later, you’re in better shape than going to look for the grounded objective truth, which doesn’t exist to begin with.

Mandi Walls: Awesome. Okay. Very cool. All right. Two other sort of recurring questions, things we kind of ask every guest. We get folks that are more experienced and less experienced, but you have a long history working in the field. So, what’s one thing you wish you had known sooner when you sort of embarked on all of this? Maybe not necessarily just running software in production, but all of the things that you’re now sort of involved with. You’ve come from operations, you were a CTO for a while, and now you’re doing this kind of work. What do you wish you had sort of known earlier, maybe in your career or earlier in this path, to incident analysis?

John Allspaw: So, one thing comes to mind. And that is a thing that I wish that I had kind of known, or at least if I would want my future self to come back and say, “Now, pay attention to this,” is that previously in the industry, solutions that have previously been arrived at, let’s say software development, in that case, you could say waterfall, let’s take that as an example. We can be really confident that the way to do a thing, whether it’s software development or incident analysis or cooperating between application developers and infrastructure and ops engineers, should always be up for critical thinking. Just because it’s been done before and has been demonstrated to be somewhat successful doesn’t mean, “Oh, yeah. Yeah. We already know how to do that.” And it doesn’t mean that it shouldn’t require people to think about how it could be improved. And early in my career, I thought, “Oh, that’s how software gets developed. Okay. I just sort of learn that and then, okay, great.” Well, actually, no. It turns out that there’s been, between then and now, a huge amount, somewhat of an explosion of practices and beliefs that say, “Actually, you know what? There is a thing that’s beyond what we took for granted as ’that’s the way, that’s the way,’” because the elders said that was how we do things around here and they didn’t really question it.

Mandi Walls: Oh, absolutely. All right. And one other question. Is there anything else about running software production, incident analysis, that you’re glad we didn’t ask you about?

John Allspaw: Yes. I’m really glad that you didn’t ask me about incident response frameworks like instant command and ICS and IMS. The reason why I’m glad you didn’t ask me is because, were you to ask me, I would punt on the question and point to a work by an amazing cognitive systems engineer named Laura Maguire, who did her PhD dissertation on what’s known as costs of coordination. There’s a talk that she gave that sort of highlights what her findings were, but the teaser that I give people is that ICS does not, and ICS-like or ICS sort of based frameworks, don’t come for free, and the costs associated with them might not be readily apparent but they are real. So, I try to turn the question into a bit of a cliffhanger to get people to want to go learn more.

Mandi Walls: Awesome. That’s what we want people to do. Absolutely. Maybe we’ll have to get in touch with her and see if she wants to come on the show sometime.

John Allspaw: Yes, you do.

Mandi Walls: That would be super awesome. All right. Well, I think that’s all we have time for today. Thank you so much, John, for joining us. This has been wonderful.

John Allspaw: Mandi, I’m so grateful you asked. Thank you very much.

Mandi Walls: Awesome. All right. Thanks, everybody, for tuning in. This is Mandi Walls and I’m wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com, and you can reach us on Twitter @pageit2thelimit, using the number two. That’s page-it-to-the limit. Let us know what you think of the show. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Show Notes

John Allspaw joins us this week to talk about incident response, and helping organizations build their own NTSB (the National Transportation Safety Board, a US government agency that investigates transportation accidents).

Introduction

John gives us an overview of what he and the other folks at Adaptive Capacity Labs are working on.

State of the Industry

John talks about the state of the industry around incident response. Learning from incidents is happening; but are organizations supporting it? Are people finding it helpful? Expertise is coming from inside the house, in that software practitioners are getting better at coping with the complexity of the systems. Where there is still work to do is around how teams learn from their incidents and postmortems. Are the artifacts generated by these exercises used after they are created, or do they just become a museum to the incidents?

Who are the Incident Nerds

There is an emerging community of folks who are really enthusiastic about learning from other organizations’ incidents, in software and across different kinds of industries. But many incident reports are still written to be filed rather than written to be read, and leave out some of the other aspects of an incident that are important. In addition to just whatever triggered the incident, there are other aspects to be learned from, like weighing potential fixes, or finding information.

Thinking about incident reports as a story, what elements make it a good story? What did the team struggle with? What was hard about it?

What are We Learning from Other Industries?

A number of “safety critical” domains have a longer history than software development with respect to dealing with incidents. Some domains have different constraints, challenges for gathering data, legal ramifications. What will the future look like, and can software development avoid some of those constraints.

How do all of these potential incidents impact not just the employees on the teams managing the incidents, but also the public, consumers, and how are they impacted by an outage? Are consumers able to make informed decisions about a company based on how incidents are handled?

As a learning exercise, do your new employees take the opportunity to read past incident reports and then ask questions?

Debunking a Myth

John deflects answering about Root Cause. You’ll just have to check Twitter.

An existing belief that leads people to a potentially incorrect outcome, is that an incident is seen by different people, with different perspectives, as different. There is not one true universal story that everyone will get from reading an incident analysis.

The goal shouldn’t be for just the person doing the analysis to understand what happened, but to also make future readers understand what happened.

What Do You Wish You’d Known Earlier in Your Career?

John talks about the practice of software engineering, and the certainty that things has changed and will change. Everything should be up for questioning of assumptions about what is the best way for something to be done.

What Are You Glad We Didn’t Ask?

John talks a bit about incident command frameworks and refers to Laura Maguire’s research on the costs of coordination and how the costs associated with robust incident response are easy to forget. Laura’s talk is linked in the Additional Resources below.

Additional Resources

Guests

John Allspaw

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.” His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement.

John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Hosts

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.