“Incidents aren’t solved just by machines. They’re solved by the people working on the machines.” - Jaime Woo.
Emil and Jaime met while working at Shopify and bonded over rock climbing. And while reading a climbing accident book they were inspired by the lessons shared and instantly made the connection to how those accidents related to software incidents.
“When we think of post-incidents in companies…you always think of this very large document,” Emil says. “You think of who was involved, timelines, descriptions, stakeholders, etc. With the book, it ran the whole range. If it was a report from park rangers, they could go into very detailed minute-by-minute exactly what happened. And then, if it was a self-submitted report it could be something like one paragraph, two people were climbing, a rock fell loose, and they fell to the ground. That was the whole description.”
In addition to these accounts the book would also have best practices and considerations to think about, or first aid techniques to keep in mind for similar situations.
Emil talks about how these reports were a catalyst for their idea to catalog and share the post-incidents that were incident to them. Many have to do with running software in production, but some don’t.
“You might think that is a bit grim,” Jaime says, “because you’re reading this book about how this thing happened to this person. And this feels weird to be so excited about all these bad things that happen. But, it’s important to have that learning. Someone already had to learn this lesson. If you don’t share that across the community, then there’s a real possibility that someone else will fall into this same trap.”
Jaime talks about the myth that if you write a post-incident review that people will read it because it’s good for them. He says, “I’m right and there’s a lot of good lessons there, so when I put it out there, obviously people will take the time to read it…That’s just not the way it works. We’re all so busy. We’re all so overwhelmed. You can’t just write the post mortem. You have to think about how the post mortem is going to be used.”
“That’s why when we printed the zine. We were saying, ‘hey, we put in a lot of effort, now just carry this with you! When you have time it’s right here for you! We made it easy for you to read and absorb and enjoy, so why not?!’”
The key point, that Emil says, is that you need to have a culture of reading the reports. But, also you need to be writing reports that are fun to read.
This really started as a passion project as Jaime talks about how they just wanted people to talk about these topics and more companies to release these post-incident reports.
“Unfortunately we don’t live in a culture where admitting mistakes, or errors, actually makes us look stronger…But, you can’t learn if you’re always pretending to be perfect.”
After they printed their first issue they took 200 copies to a conference to distribute.
“The reception was so lovely! We started having people come find us and were like, ‘I heard you have a zine and I really, really want a copy’,” Jaime says.
Emil describes the reaction of those who saw their incidents in the zine. “The sheer joy of seeing some of the people at the conference–we put their incidents in the first issue–seeing their sheer joy of seeing their blood, sweat and tears on paper, that was the best feeling ever!”
Jaime: “The biggest one, while we’re reading them, it’s clear that different authors have different intentions. The ones that are most interesting are the ones who have built a story.”
“Incidents aren’t solved just by machines. They’re solved by the people working on the machines.”
“Include the people in the narratives. Interview them. Really understand how things worked.”
Emil talks about their favorite incident which is the story of Apollo 12. On it’s way to the moon, the rocket was struck by lightning 30 seconds after lift-off.
“When that happened all the controls inside the rocket went haywire…they had every single alarm going off. They’re trying to decide in this moment if they need to abort the mission. And, at one point, one of the controllers in Houston said to switch SCE to Aux…and, by chance, one of the astronauts knew which switch he was talking about. And, when he switched it everything went back online.
“Why both Jaime and I love this story is because the controller at NASA…”
“John Aaron,” says Jaime.
“John Aaron! John Aaron knew to make that call because a year earlier in the simulator he was playing around and out of curiosity he got the simulator in a really weird state. And, he tried to get out of it, rather than resetting the simulator. And that curiosity led him to learn that if he turned SCE to Aux it would return the system to a normal state. As a consequence of knowing off the bat how to fix that he earned the name, ‘steely-eyed missile man’.”
Jaime explains, “we’ve been working on a product (Ovvy) that will help teams remove bad alerts…because then that frees up your time to work on the things you actually want to do.”
“It frees your time up to read Post-Incident Review zines!!” Emil adds.
Emil says, “[the] one thing I would have liked to have learned sooner [about running software in production] is that it’s about the people. Very really is it the technology that will tell you whether or not you’ll have a reliable system.”
Jaime is glad that we didn’t ask, “show we do what Google does for SRE?” To which he says, do I have to be Beyonce to be successful?
Jaime began his career as a molecular biologist before following his passion for communications, working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function. He co-founded Incident Labs, helping teams better manage their incident response data to return hours for planned work. He is also an avid lover of dumplings.
Emil is a site reliability engineer, who previously worked on caching, performance, & disaster recovery at Shopify and the internal Kubernetes platform at DigitalOcean. He has spoken at Strange Loop, Velocity, & RailsConf, and is the program co-chair for SREcon EMEA 2019 and SREcon Americas West 2020. He has guested on the podcasts InfoQ and Software Engineering Daily, and contributed a chapter to the O’Reilly book “Seeking SRE.”
Scott McAllister is a Developer Advocate for PagerDuty. He has been building web applications in several industries for over a decade. Now he’s helping others learn about a wide range of web technologies. When he’s not coding, writing or speaking he enjoys long walks with his wife, skipping rocks with his kids, and is happy whenever Real Salt Lake can manage a win.