Post-Incident Reviews With Jaime Woo & Emil Stolarsky

Posted on Tuesday, Jun 16, 2020
Jaime Woo and Emil Stolarsky are Co-founders of Incident Labs and curators The Post-Incident Review zine. They talk with Scott about the importance of learning from mistakes, what makes a good post-incident report, and more importantly how to get people to read them.

Transcript

Scott McAllister: Welcome to Page It To The Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Scott McAllister, at STMcAllister on Twitter. Today we’re going to talk about postmortems. After incidents, it’s important to take a moment and talk about what you’re doing right, where you can improve, and most importantly, how to avoid making the same mistakes again and again. Well-designed postmortems allow your teams to iteratively improve your infrastructure and incident response process. The post-mortem concept is well-known in the technology industry, but it can be difficult for newer individuals, teams, and organizations to adopt the cultural nuances required for effective postmortems. We’re joined today by Jaime Woo and Emil Stolarsky from Incident Labs. They are also the authors and curators of The Post-incident Review, a zine about incident response that they describe as, “A love letter to the community.” Jamie and Emil, welcome to the show.

Jaime Woo: Thank you so much, Scott.

Emil Stolarsky: Thanks for having us.

Scott McAllister: Sure. So to get us started, tell us a little bit about yourselves, about the Post-incident Review and what it is, why you got it started, that kind of stuff.

Jaime Woo: Yeah, Emil and I met when we were both working at Shopify. And what made our friendship was just that we have this passionate enthusiasm for nerding out about stuff. And so let’s forewarn your listeners that they’re going to get a taste of that today. We definitely, when we get heated and excited about something, we start talking a million words a minute, and the hands go flailing, and we just, we get really, really deep into it. And one of the things that we really love talking about is after something goes wrong. Or even if something goes right, how do you learn the lessons from that? And it’s something that we’ve talked about since we first met, but especially when we started our own business, when we started Incident Labs. This is kind of the thing that we’re most interested in now. Everyone wants to start a problem, that’s why you do a startup, right? You want to fix a problem. For us, it’s really like, how do you get better at something? How do you get the feedback to be able to do the things that you need? And what we realized partly was that we wanted to work on this stuff, but we also wanted to read about this stuff. And there just wasn’t as much stuff out there as we wanted. And one day Emil has this climb accident book, because we both love to rock climb as well. And it just kind of lit a light bulb above our heads. Emil, actually you can talk more about the climbing accidents book.

Emil Stolarsky: Yeah. Both Jamie and I love rock climbing. And in the climbing community, there is a few guides about how to climb and there’s some organizations, accredited organizations. But a lot of it is about storytelling, and sharing, and lessons learned. Like, hey, this accident happened at this crag, which is a climbing location. This is what happened, don’t make these mistakes. And the American Alpine Association, we can add it to the show notes the actual name of the organization of the book, every year they release a climbing accidents of 2017. And what it is, is it’s a small book that’s just a collection of climbing accidents across North America and their descriptions. And it’s really fascinating, looking at this book. Because, when I think we think of post-incidents within companies, or postmortems, or whatever you refer to it as internally, you always think of this very large document. You think of who was involved, the timelines, descriptions, stakeholders, et cetera, et cetera. With the book, it ran the whole range. So if it was a report from park Rangers, they could go into very detailed, minute-by-minute, exactly what happened. And then if it was a self-submitted report, it could be something, it would be like one paragraph, two people were climbing, a rock fell loose and they fell to the ground. And that’s, one paragraph is that description. And there was a couple of things coming from this. The booklet also, beside these sort of descriptions of different climbing accidents, would have also almost some best practices. Hey, when dealing with this type of rock, these are sort of some of the considerations to think about. Or, when you have this class of accidents, these are some of the first aid skills you should be aware of. And it really sort of helped, I think, catalyze a conversation in the climbing community of increasing that culture of always talking about different accidents that happened. So that’s that one interesting. The other really interesting thing was this range of quality in incident reports. It didn’t matter, every incident didn’t have to be sort of perfect. It could just be, just talk about it. Just talk about the accident that happened to you. That’s, telling that story is good enough. And I think this was kind of the catalyst for us with the Post-Incident Review, is rather than trying to really raise the bar and try to say all post-incident reviews have to be perfect, what if we just took the ones that were interesting to us, regardless of their length or like thoroughness, and brought them together into this physical, handheld piece.

Jaime Woo: Yeah. You might think it’s a bit grim. Because what would happen is, as Emil is reading this book, he’s be like, “Oh my God, look at this accident.” Or, “Holy crap, this thing happened to this person.” And you’re like, this feels weird to be so excited about all these bad things that happened. But I think it’s really having that learning underneath. It’s like, hey, someone already had to learn this lesson. If you don’t share that across the community, then really, there’s a possibility that someone else will fall into this same trap. And that’s the real thing that became an insight for us. It was like, hey, if you’re always waiting for it to be perfect, or to fully detailed enough so that you can make sure that it is the right thing, there’s so many lessons that then aren’t going to be learned about, and we don’t become comfortable sharing this kind of stuff. And so realizing that, it could be interesting being printed, that these stories themselves did not have to be exhaustive. And that there must be people who care about post-incident reports the same way we do. Why can’t we do some like this? And so that really inspired us to go ahead and just say, well, let’s print it. Let’s see what happens when it’s in print. What happens when you take it outside of just on a website. Because you’re always reading them only online. What happens if you actually have it physically, in your hand? And people loved it. I think people have been waiting for something like this.

Emil Stolarsky: And then it also gave us the opportunity to just nerd out hard. And what would it look like to take … Because all the postmortems we generally see are in Google Docs. And what does it look like to put it into paper? But if, this is the sort of … Sure, you can just print it, but that’s not the fun part. The fun part was Jaime and I going to Zine shops in Toronto, or different art shops and looking like, what does like an engineering journal look like? What does a different zine look like? What does the paper … We specifically, I can’t remember the kind of paper, but it wasn’t normal printer paper for our first issue. Because that weight, the weight of the paper just changed the context of the incident. You were like, oh, I’m going to be quiet in this room as I read this. I’m going to dim the lights a little bit because it’s like you’re elevating it. It was just interesting seeing that the feelings changed towards it completely.

Scott McAllister: That’s so funny. Because I could relate to that same feeling of the difference in format to music, right? All music today is pretty much available digitally or even streaming, but I collect vinyl. And so when I get an LP and I hold it in my hand and I pull it out of the sleeve and I see this big jacket and it’s like, oh, this is a different experience. Even though it’s the exact same song, it’s still just a different experience. That’s fantastic. We have a tradition on our show to ask our guests to debunk a myth. So what are some myths or misconceptions that you would want to debunk about postmortems?

Jaime Woo: Oh, I think the biggest one will be this idea that if you write it, then people will read it because it’s good for them. I think this is a hallmark thing that happens, is that people go, well, I’m right. I put a lot of work into this, there’s a lot of good lessons. So when I put it out there, obviously people will take the time to read it. They’ll find it, they’ll read it, digest it, they’ll apply the lessons. Ah, perfect, publish. Hit publish, and then that’s it. And that’s just not the way it works. We’re all so busy, we’re all so overwhelmed. Even if something is good for us, we don’t really have enough time. Or it’s fairly difficult to find stuff now because there’s so many competing things going on. And so part of what, actually, our very first issue we talked about was, you can’t just write the post-mortem. You really have to think about, how is the post-mortem going to be used? How are you going to make it discoverable? How are you going to guide people towards applying the principles of it? Because you’re already spending so much time working on the post-mortem, that if you are not going to spend that little bit of extra effort thinking about how it’s going to be used after, if it just sits in a drawer, then did you really need to spend that effort? And that’s a really tough thing, I think, for people to hear. Because they’re like, but I did all this good work. Why aren’t people responding to it? And it’s not personal. It’s not because they don’t want to read it, or they find it raw or boring. It’s just, if you think about it, we’re all so busy now, how do you get that in front of someone so they can actually take the time to pay attention to it? And that’s why when we printed the zine, all of a sudden we’re saying, hey, we put in a lot of effort, now just carry this with you. When you have time, it’s right here for you, we made it easy for you to read, and to absorb, and to enjoy. So why not? And then people did.

Emil Stolarsky: To Jaime’s point, there’s two sides to it. The one is ensuring that you have a culture of going and reading the posts and the reports. And so it’s, are they accessible to the whole company? Do you incentivize? So within the zines we have some incidents, and then we always have an article that we write about something related to the topic, and maybe a fun story. And in the first issue we wrote about creating, what’s the word I’m looking for, reading groups around the reviews. And how do you encourage that culture? And then on the flip side, you also want to be writing reports that are actually fun to read. If you’re writing them like a boring manual, you not going to be excited to read them, and then you’re not going to be gaining any learning value from them. A lot of it comes from the sharing of those stories. And so it’s important to be able to get both sides of that coin.

Scott McAllister: Post-incident report book clubs. I like that, I like that. That’s good, because then it gives you a reason. It gives you, you know you’re getting together with that group so you have to be prepared when you get to that group. That’s fantastic. So before embarking on this zine, what were your goals as you went into it?

Jaime Woo: I mean, we didn’t really go in with any expectations, really. This was kind of a passion project. Because, like I said, we were just so excited to talk about this stuff. And frankly, we just wanted more people to talk about this stuff. And we wanted more post-incident reports, right? As we were looking for them, we recognized that companies would be a little bit timid, sometimes, putting out stuff. It’s really vulnerable to say that an error happened, and then to pick apart exactly why it happened. And even if afterwards you say, well, this is how we’re going to do better, or this is how we’ve mitigated it and won’t happen again. It’s such a human thing. When we admit vulnerability and fault, we’re worried that that’s going to change the way people see us. Unfortunately, we don’t live in a culture where admitting mistakes or errors actually makes us look stronger. We clearly have a society where admitting any kind of vulnerability is seen as a weakness. And instead, we’re just supposed to pretend like we’re perfect, but we’re not. And you can’t learn if you pretend to be perfect. You have to learn. I mean, anyone who’s tried to learn piano, or learned to rock climb, or learned to cook. If you are trying to be perfect from the get-go, you won’t be able to ever get good at what you’re trying to do. And so for us, we loved learning about this stuff. And so we were just hoping, maybe if we put this out as some kind of positive intention to the world, maybe people will feel more comfortable wanting to do more of these. Because if we care about it and we can say, look, this is okay. It’s actually something to be celebrated. Does that change the perception of it? So we printed 200 copies on a whim of the first issue. We brought them to a conference. It was so heavy because each one, the first issue was … Because we were doing this only three times a year, each one weighed a pound each. And so we’re carrying this through…

Emil Stolarsky: Spread across three suitcases.

Scott McAllister: Wow.

Emil Stolarsky: Hoping they don’t get flagged through security as we’re going through the airport.

Jaime Woo: Yeah. Just, it was so heavy. I mean, we had to figure out what clothes not to bring because we were bringing 200 copies of the zine. On a hope that, hey, maybe someone will like them. And it felt so much like, I don’t know if anyone ever tried to run for like student council and in grade school, but when we’re handing some of them out in the beginning, it felt so much like, please vote for us. Please read this thing that we love. But the reception was so lovely. Everyone who got a copy … We started having people come find us. Being like, “I heard you have a zine and I really, really want a copy of it.” And people would read it and go, I think it just kind of struck a chord with them. Because we’re all in this field because we care about our systems, and we want them to improve, and we all want to be learning. But what was so sad was that it was either just some technical document to be put away, or it was something that was just posted online and then you just kind of ignore it. And here it was like, no, actually this is something worth paying attention to. So we’ve been really happy about that.

Emil Stolarsky: And then some of the sheer joy of seeing some of the people at the conference, we put their incident into the first issue. And seeing their sheer joy of seeing their own blood sweat and tears in paper like that was the best feeling ever. Because they were probably like, okay, we’re going to publish this blog post. No one’s really going to look at it. And we were like, no, no, we’re going to like put this together into these zines and carry them across the ocean and hand it to you. It was just so much fun and excitement.

Scott McAllister: I can only imagine. I had similar experiences, in a past life I was a sports reporter. I would write for websites. And that was cool, it was super cool to still see my articles on websites. But then I got a job to write a story for an actual paper. And when I saw that in print, it was life-changing. It was like, whoa, that’s cool. So yeah, I can totally relate to that. From 200 copies at a conference, are you still producing and distributing the physical copies, or do you have online versions? How’s your reception been since then?

Jaime Woo: I mean, we always knew that there needed to be an online version, just because not everyone would necessarily be able to get to a physical version. And also, we wanted to lean into our zine roots, where you can actually download a version that you print out yourself and fold into a zine at home. Because I think what we wanted to also show was that this is not something that should feel too out of reach for people. It shouldn’t feel like, okay, well now this is only published in a magazine form, and then that’s it. It’s like, actually this is about just having the paper issue in your hand, and being able to read it and think about it. And the big thing there is, like you were saying about seeing your stuff in print. It’s because you have to dedicate yourself to reading that print issue, that’s why it’s so special, right? Right now I have 42 tabs open. So if I’m reading your post-mortem, I have it beside maybe a recipe of something I want to make, plus three articles I’ve been meaning to read, plus my, maybe whatever else I have open. Of course, then, everything just starts to blur. But when you got a printed copy in front of you, you really sit down and say, okay, I’m actually going to focus my time on it. We’ve switched to making it actually to a monthly edition. Because now we realize, before we’re printing them a little bit bigger so that they could be at conferences, or they could be, if we’re going to ship them, then it would be kind of substantive. But we realize now, it’s better to have small appetizers, I guess, every month of a incident report, than it is to try to wait every three months. And so that’s been the biggest change for us. We’re excited for the point where we can, I mean, we’re happy to print, right? If anyone wants to, if any company wants to step up and help us print these and send these out, we still love the format. We think that’s the most important thing. But that, yeah, that’s the big thing. We’d love to get these out, but it is pretty pricey, right? To have to print them individually, send them out and all that kind of stuff. But we really do this for the community. And so we would love to be able to keep printing and sending them out. Actually Scott, later, you need to tell us where we can send you some printed copies. Because we will, lovingly, artisanal hand-made copies, we’ll send you some so you can see.

Scott McAllister: That would be awesome.

Jaime Woo: Yeah.

Emil Stolarsky: There’s the fun story of, we had to get these through airport security. And then there’s always the background information, of how the office smelled like a printing house because the printer was making something like, I think in the thousands of pages. Because to make 200 zines, each zine had 40 pages. And then we had a whole workshop where Jamie would fold the papers, and I would stand on them to give them that bend. There’s the, this is the amazing zine. And then there’s the background amateur production hour.

Scott McAllister: That sounds, oh I can only imagine. I’m picturing in my head right now of if y’all climbing on top and pushing them down. That’s great.

Emil Stolarsky: Artisanal. Artisanal production.

Scott McAllister: Yes, of course, of course. I would love a copy. So I’ll definitely give you my address after this. So while you’ve brought these post-incident reports together, I’m sure you’ve seen a lot of them. Talk about the common threads that you’ve seen throughout each of the incidents.

Jaime Woo: Oh man. I think the biggest one is that, while we’re reading them, it’s clear that different authors have different intentions. What are they kind of thinking when they’re sharing this? And the ones that are most interesting, we’ve noticed, have kind of built a story. It really does matter to have a narrative, to kind of put things together. Because then you get context that you don’t get if you just … Some authors will write in a passive voice, and everything is just about what the machines did. And what’s difficult about that is that incidents aren’t resolved just by machines. They’re resolved by the people who are working on the machines. And when you pull the people out of the post-incident report, you’re really not telling the full story. Because how people reacted to things, and what knowledge they knew, and what they chose to do, that actually matters. Understanding why things happened, and how that story came about, I think sometimes people worry that if it’s too story-like, maybe it won’t be taken seriously. So they go the opposite way. They go really dry, they go really technical, they go really boring. The problem then is it becomes unreadable. Everyone will pretend to read it and then no one will actually read it, and then off it goes. And so tell the actual story. And we’ve been seeing this, Lorin Hochstein at Netflix has been doing a lot of this work around saying, hey, include the people into the narratives. Interview people. Really get an understanding of how that works.

Scott McAllister: Yeah. I think that having a common narrative is key to getting a story across, even though all these incidents have a lot of similarities. What are some of the unique ones that have stuck out to you that have really left an impression?

Emil Stolarsky: In the latest issue of the posts, oh, should I set some background? Okay, I’ll set some background. I have never submitted something on time. And the consequence … One of the reasons for that is I love randomly browsing Wikipedia, and reading all these obscure space stories. So our favorite incident, we didn’t include the post-mortem of it, but we included it in our sort of central story, it’s the story of Apollo 12. So Apollo 12, about 30 seconds after liftoff, got struck by lightning twice. In mid-flight. 30 seconds after liftoff. And when that happened, all the controls inside the rocket went haywire and they lost data, emission control. And so you can hear the audio for this online. And all the astronauts are like, “We have every single alarm going off.” It’s just kind of like complete movie moment. And they’re trying to decide in this moment, because they have this small window of whether or not they should abort the mission. And at one point, one of the controllers in Houston was like, “Switch SE to aux.” Which is like, switch the power to auxiliary power. And you can hear, again, on this radio loop where the cap com, so the capsule communications controller, goes, what the hell is that? And then they’re like, okay, send it up to the astronauts. And then the astronauts are like, what? And one of the astronauts happened to, by off chance, know where the switch is, switched it, everything comes back online, Apollo 12 keeps going to the moon. All in the span of six, seven minutes. And you can hear, towards the end, as they’re sort of getting out of the earth’s gravity and making it proper into orbit, you can hear all the astronauts start nervously laughing. Just because they’re basically in shock. They had no idea. They’re like, oh, we need to start doing testing more. And why both Jaime and I love the story is because the controller in NASA who ended up the, I think it was the e-comm controller, I can search up his name…

Jaime Woo: John Aaron.

Emil Stolarsky: John Aaron. John Aaron knew to make that call because it was, about a year earlier in the simulator, he was just playing around, essentially. And out of curiosity, he got into this really weird simulator state, and he tried to get out of it rather than resetting the simulator. And that curiosity, which is, he was like, “Oh, if I switch SE to aux, it reset it back into a normal state.” That curiosity, hadn’t … Go explore that a year plus ago, and then know that obscure portion of his system. And then, as a consequence of him just knowing off the bat how to fix that, he earned the nickname of Steely-eyed Missile Man, henceforth. Which is just such, you just find out … You clearly, all these people were kind of holding on to the seat of their pants, trying to get into space and putting themselves in large explosives. But then also, it’s just such a crazy feat of engineering.

Scott McAllister: That’s amazing. And yet reminds you that curiosity is the common characteristic among all great engineers, right? That’s why we build what we build, because we’re like, huh, how does that work? And so that’s great, great reminder. Bridging off of that idea, what are some of the common keys to success that you’ve seen when resolving incidents? I mean, obviously curiosity. What are some other ones?

Jaime Woo: I think that, I mean, it is the central one, right? It’s the ability, it’s the willingness to learn. It’s having that ability to remove your ego and have that curious mind towards things, which can be really difficult. Because we have time pressures, and obviously there are a lot of other pressures involved in resolving incidents. It can almost seem like there’s no time to be curious. There’s no time to just be frivolous. But it’s not frivolous, right? We are working with complex systems. You can’t inherently know everything about a complex system. You have to just dig in and play with it. And you never know what might be helpful. I mean, that’s at least what Emil says when he’s on Wikipedia all the time. He’s like, “We’re working on a complex system. So, I’m just trying to figure it out, just in case later, we need a story for something. This is why I’m reading Wikipedia for seven hours a day.”

Emil Stolarsky: Look, we don’t have enough time to talk about the one-time astronauts decided to go on strike. But that is a damn good story, okay?

Jaime Woo: You can read that story. We wrote about it in our other newsletter, The Morning Mind-Meld, where we kind of just more riff on stuff that we see. But that was so fascinating. I think a huge thing, actually, is to look beyond just the thing that you have in front of you. I think why we care about space stuff is because, especially in SRE, you can see that there’s so many lessons to learn from the medical industry, from the nuclear industry, from all these other industries. And there’s always been this idea of, hey, if you’re really into math and science, maybe you don’t necessarily need to know the arts, or vice versa. And it’s not oppositional. You actually should know, you should draw from everything. You don’t know where that inspiration comes from. You don’t know where ideas come from. And I think that’s the most interesting thing, is that so often when we meet these people in person later, you realize, oh yeah, they’re a musician, or they do this other thing that’s really interesting. Everyone always has different talents that they draw from. It is really rare to see someone who can think that way, who only just deep dives in the very specific thing that they’re in. Because then you’re really locked into a specific way of thinking. And to do good problem solving, that’s not going to benefit you the way that you think it is. And so that’s something that we’ve kind of noticed, is that when we meet these people after, we’re just like, oh, that is so interesting that you have all these other things that you draw from into your work, even though they don’t seem technical.

Emil Stolarsky: In the first three issues, all of the post-incidents we looked at, they were technological ones. They were companies talking about when their production systems went down, et cetera. In the latest, in issue four, we actually took a post-incident review from Bose headphones talking about, they did a firmware update for their Choir Comfort headphones. This is almost like an advertisement at this point. And the update had completely messed up the noise-canceling functionality.

Jaime Woo: That’s what people [perspected 00:27:30].

Emil Stolarsky: Right. And so it became such a big deal that they ended up going and doing this full investigation report and full writeup. And that was actually the one we wanted to elevate into the zine, because it doesn’t just have to be about software. Good engineering and good investigations are cross-industry, cross-group. Right?

Jaime Woo: Yeah. Sorry, I’m just going to, yeah. The reason I just wanted to put that in, Emil, it was just because it actually wasn’t. It was, people complained about it, and then it was … So I just want to make sure just in case, I don’t know. If someone that from Bose listens, I don’t want some angry emails being like, “Actually,”

Emil Stolarsky: But I mean, let me just fight them.

Jaime Woo: I know, I know. I know. I was just like, okay. Yeah, yeah, yeah.

Emil Stolarsky: I’ll be like, “Excuse you. It’s in a podcast now.”

Scott McAllister: Yes. So tell me, tell us a little bit about some of the projects you’re working on outside of the Post-Incident Report, with Incident Labs and what you do there.

Jaime Woo: So yeah, actually we’ve been working on a product that can help teams remove bad alerts. This is something that we really think is interesting. We think that, everyone knows that there is a situation where you get these alerts and you look at them, and you go, do I bother opening this Pandora’s box, or do I just push it aside for later? Or it comes at one in the morning and you’re like, I really don’t want to deal with this right now. The problem is, these things kind of add up. And so this is something we really want to tackle, because we think that it’s kind of housecleaning that makes things a lot easier once you start getting rid of them. And so that’s been a huge thing for us, is to help teams just figure out where the bad alerts are, see a whole inventory of them, and then clear them out. Because really, once you get rid of the bad alerts, it frees up your time to do the stuff you actually want to do. All the product work you want to do, and not waste your time on this kind of stuff.

Emil Stolarsky: It frees your time up to read [inaudible 00:29:24]’s new review zines.

Jaime Woo: Or Wikipedia.

Emil Stolarsky: Or Wikipedia, yes. Or NASA reports, yes.

Scott McAllister: The important things, the important things. So where can people find this zine?

Emil Stolarsky: So people can find zine at zines.incidentlabs.oi. We’ll add that into the show notes. And then for the project with tackling bad alerts, you can check that out at abi.io. Send us a message there, and we’ll have to show you what we’re working on, and hear about your bad alerts.

Scott McAllister: Nice. So we have another tradition on this show where we like to ask a couple of recurring questions. So what’s the one thing you wish you’d known sooner when it comes to running software in production?

Emil Stolarsky: For this one, one thing I’d like to have known sooner is that it’s about the people. Whenever you sort of look across different organizations, very rarely is it the technology that will tell you whether or not you will have a reliable system. It’s about the culture you’re able to foster, that will be the strongest indicator. It will be the one that tell you whether or not your systems are reliable. It’ll be, do you build systems that care about the operators and about the people, or do you not care if you’re creating a lot of pages late at night? And having and adopting that mindset has redefined how I look at the technologies that I’m dealing with at the end of the day. Because it resets the frame and perspective I have to look at it day to day.

Scott McAllister: Nice. All right. So is there anything about running software and production that you are glad we did not ask you about?

Jaime Woo: Yeah. So for this one, I mean, this is really interesting. Because we specialize in SRE. And the question that we get so often is, should I do what Google does for SRE? And it’s a really interesting question, because obviously Google pioneered SRE, they wrote the books on it. And there’s a lot of knowledge there. But if you are a smaller company trying to do SRE, to run your software and production exactly the same way Google does it, I think the closest example I have is, I love Beyonce. I think Beyonce is phenomenal. I love Beyonce, I listen to Beyonce, I try to do what Beyonce does. But can I actually achieve the same things that Beyonce achieves? Now, this is not a slight against me in any way. But I am not Beyonce. The way companies are like, we’re going to do SRE just like Google, you’re basically being like, I’m going to be Beyonce. And I have an unfortunate a reality for everyone, you’re not Beyonce. So try to listen to Beyonce, try to hear what Beyonce is saying and maybe apply some of that. But you cannot do everything that Beyonce does exactly the same way.

Scott McAllister: That is so true. On so many levels. Well, Emil, Jaime, this has been an absolute pleasure. Thank you so much for joining us today.

Emil Stolarsky: Thank You for having us.

Jaime Woo: Thanks so much, Scott.

Scott McAllister: Is Scott McAllister, and I’m wishing you an uneventful day. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes on pageittothelimit.com, and you can reach us on Twitter at Pageit2thelimit. Using the number two. That’s @pageittothelimit. Let us know what you think of the show. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Show Notes

“Incidents aren’t solved just by machines. They’re solved by the people working on the machines.” - Jaime Woo.

Starting the Zine

Emil and Jaime met while working at Shopify and bonded over rock climbing. And while reading a climbing accident book they were inspired by the lessons shared and instantly made the connection to how those accidents related to software incidents.

“When we think of post-incidents in companies…you always think of this very large document,” Emil says. “You think of who was involved, timelines, descriptions, stakeholders, etc. With the book, it ran the whole range. If it was a report from park rangers, they could go into very detailed minute-by-minute exactly what happened. And then, if it was a self-submitted report it could be something like one paragraph, two people were climbing, a rock fell loose, and they fell to the ground. That was the whole description.”

In addition to these accounts the book would also have best practices and considerations to think about, or first aid techniques to keep in mind for similar situations.

Emil talks about how these reports were a catalyst for their idea to catalog and share the post-incidents that were incident to them. Many have to do with running software in production, but some don’t.

“You might think that is a bit grim,” Jaime says, “because you’re reading this book about how this thing happened to this person. And this feels weird to be so excited about all these bad things that happen. But, it’s important to have that learning. Someone already had to learn this lesson. If you don’t share that across the community, then there’s a real possibility that someone else will fall into this same trap.”

Myths about Post-Incident Reviews

Jaime talks about the myth that if you write a post-incident review that people will read it because it’s good for them. He says, “I’m right and there’s a lot of good lessons there, so when I put it out there, obviously people will take the time to read it…That’s just not the way it works. We’re all so busy. We’re all so overwhelmed. You can’t just write the post mortem. You have to think about how the post mortem is going to be used.”

“That’s why when we printed the zine. We were saying, ‘hey, we put in a lot of effort, now just carry this with you! When you have time it’s right here for you! We made it easy for you to read and absorb and enjoy, so why not?!'”

The key point, that Emil says, is that you need to have a culture of reading the reports. But, also you need to be writing reports that are fun to read.

Goals for the Zine

This really started as a passion project as Jaime talks about how they just wanted people to talk about these topics and more companies to release these post-incident reports.

“Unfortunately we don’t live in a culture where admitting mistakes, or errors, actually makes us look stronger…But, you can’t learn if you’re always pretending to be perfect.”

After they printed their first issue they took 200 copies to a conference to distribute.

“The reception was so lovely! We started having people come find us and were like, ‘I heard you have a zine and I really, really want a copy’,” Jaime says.

Emil describes the reaction of those who saw their incidents in the zine. “The sheer joy of seeing some of the people at the conference–we put their incidents in the first issue–seeing their sheer joy of seeing their blood, sweat and tears on paper, that was the best feeling ever!”

Common Threads in Incidents

Jaime: “The biggest one, while we’re reading them, it’s clear that different authors have different intentions. The ones that are most interesting are the ones who have built a story.”

“Incidents aren’t solved just by machines. They’re solved by the people working on the machines.”

“Include the people in the narratives. Interview them. Really understand how things worked.”

Common Keys to Success in Resolving Incidents

Emil talks about their favorite incident which is the story of Apollo 12. On it’s way to the moon, the rocket was struck by lightning 30 seconds after lift-off.

“When that happened all the controls inside the rocket went haywire…they had every single alarm going off. They’re trying to decide in this moment if they need to abort the mission. And, at one point, one of the controllers in Houston said to switch SCE to Aux…and, by chance, one of the astronauts knew which switch he was talking about. And, when he switched it everything went back online.

“Why both Jaime and I love this story is because the controller at NASA…”

“John Aaron,” says Jaime.

“John Aaron! John Aaron knew to make that call because a year earlier in the simulator he was playing around and out of curiosity he got the simulator in a really weird state. And, he tried to get out of it, rather than resetting the simulator. And that curiosity led him to learn that if he turned SCE to Aux it would return the system to a normal state. As a consequence of knowing off the bat how to fix that he earned the name, ‘steely-eyed missile man’.”

Other projects

Jaime explains, “we’ve been working on a product (Ovvy) that will help teams remove bad alerts…because then that frees up your time to work on the things you actually want to do.”

“It frees your time up to read Post-Incident Review zines!!” Emil adds.

Recurring Questions

Emil says, “[the] one thing I would have liked to have learned sooner [about running software in production] is that it’s about the people. Very really is it the technology that will tell you whether or not you’ll have a reliable system.”

Jaime is glad that we didn’t ask, “show we do what Google does for SRE?” To which he says, do I have to be Beyonce to be successful?

Additional Resources

Guests

Jaime Woo

Jaime Woo

Jaime began his career as a molecular biologist before following his passion for communications, working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function. He co-founded Incident Labs, helping teams better manage their incident response data to return hours for planned work. He is also an avid lover of dumplings.

Emil Stolarksy

Emil Stolarksy

Emil is a site reliability engineer, who previously worked on caching, performance, & disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He has spoken at Strange Loop, Velocity, & RailsConf, and is the program co-chair for SREcon EMEA 2019 and SREcon Americas West 2020. He has guested on the podcasts InfoQ and Software Engineering Daily, and contributed a chapter to the O’Reilly book “Seeking SRE.”

Hosts

Scott McAllister

Scott McAllister

Scott McAllister is a Developer Advocate for PagerDuty. He has been building web applications in several industries for over a decade. Now he’s helping others learn about a wide range of software-related technologies. When he’s not coding, writing or speaking he enjoys long walks with his wife, skipping rocks with his kids, and is happy whenever Real Salt Lake, Seattle Sounders FC, Manchester City, St. Louis Cardinals, Seattle Mariners, Chicago Bulls, Seattle Storm, Seattle Seahawks, OL Reign FC, St. Louis Blues, Seattle Kraken, Barcelona, Fiorentina, Juventus, Borussia Dortmund or Mainz 05 can manage a win.


postmortems