The VOID With Courtney Nash

Posted on Tuesday, Jan 4, 2022
The Verica Open Incident Database is a community-contributed collection of software-related incident reports. Courtney Nash tells us all about it.

Transcript

Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those system. I’m your host, Mandi Walls. Find me @lnxchk on Twitter. All right. Welcome, folks. Thanks for joining us this week. With me today is Courtney Nash. Courtney, tell the world who you are and what you do right now.

Courtney Nash: Hi, Mandi. This is so exciting, partly because I feel like Mandi and I have known each other for so long now. That’s how it started. I met Mandi when I was chairing the Velocity Conference for O’Reilly Media and I worked there for quite a while. Then I floated around startup land and I have ended up at a startup called Verica, which plays around in the, we call it continuous verification but if people don’t know what that is, chaos engineering space which is tooling designed to help people to safely mess around in their systems and find out where the cliffs are. As a part of that, I have come onboard as a researcher. I have a weird background in neuroscience and cognitive systems and then I fell down this rabbit hole with John Allspaw, as one does, and got really into the whole space of resilience engineering and human factors and safety science and all this stuff.

I’m trying to bring some of that academic rigor, but also just a general sense of empathy towards people who have to handle software when it falls over, as it does. I’m doing some research that’s typical product research type of things and when I first started doing that, I was really spelunking Kubernetes and Kafka related things. I was building up this small little database of public incident reports, because you can’t always get inside the machine.

Mandi Walls: Yes.

Courtney Nash: Right. You are inside the machine, but maybe we’ll have to talk about that later. I had this point where it says, this is really useful and honestly we’ve all wanted something to exist for a long time that collects all of these public incident reports into one place.

I’m not the first person to have this idea, by the way. I do not want to lay claim to that. Many others have laid the groundwork for that. But basically what, I’ll come back to that, because I want to give the credit where it’s due, but what we ended up building was this thing called the VOID, which is the Verica Open Incident Database. Yes. The goal of the VOID is to provide information that helps people ask better questions about software incidents and to potentially study these things across, and what people say about them, right. Study these across a more consistent corpus because if you’ve ever tried to go find these things, you may know, people don’t always, they’re hard to find. Some people are very transparent about these things and that’s wonderful and we want to see more of that. We wanted to just shine a light on all of that and that will ultimately, hopefully help us make the software we build and the internet that it sits on and maybe safer and a little more resilient. So the goal of the VOID is to just be a place where people can drop these incident reports and then we can spend time studying that, other people could spend time studying that, we could study things together, that would be really fun. The point is other industries do this, especially safety critical industry.

Mandi Walls: Yeah.

Courtney Nash: So the airline industry is my favorite example. Planes were falling out of the sky for a while there, in the 80s and 90s, if you’re old enough like us to remember that. That industry said, well, okay, I know that we’re competing with you and we’re competing with each other, but the, really this is not good and we need to get together and share information about our incidents and our accidents and along with a few other things, not just that. I mean, it’s a heavily regulated industry, but their safety record got way better. So I think if we can share this information much more transparently, be really open about it then, a rising tide lifts all boats. Everybody’s game can get better.

Mandi Walls: Absolutely.

Courtney Nash: So that’s the general idea behind the VOID.

Mandi Walls: Yeah. The airline one’s always interesting, right, because it’s very apparent when there’s a large disaster or catastrophe or some, it’s in the news, right.

Courtney Nash: Yeah.

Mandi Walls: It’s big deal when something happens there and that, and like you say, it rolls downhill. You don’t then know how many folks aren’t going to want to take a flight based on that perceived increase in accidents. Looking at the internet as part of, it’s like an extension of your brain most of the time, it’s where you live a large part of your life and having that be trustworthy and safe and working when you need it to becomes increasingly important, the more things that we do on it.

Courtney Nash: Yeah. And that’s the point. The internet and software are integral to our lives. It used to be just cats and I mean, honestly the internet was, okay, it wasn’t really built for pictures of cats, but it’s really perfect for that. But the recent Facebook, WhatsApp outage, I mean, WhatsApp runs businesses in Brazil. It’s a huge part of India’s economy, of all of these other places outside of the U.S. and our little bubble. There’s all these downstream knock-on things that can happen that are potentially life threatening and at least in some cases economically damaging. So the transparency, just to me, seems really important and that’s why, I think, a very long road for our industry.

Mandi Walls: Yeah.

Courtney Nash: I think we live in pockets where sharing this information just seems really natural to us, but that is not the norm, I would say.

Mandi Walls: Yeah. And yes, for those of us of a certain age who’ve been watching this evolution, why now? What has happened, do you think, in the past, even the last few years that has brought this back out to the forefront? So it’s not like, I feel like some of us have been publishing these things into the ether for years, but now it feels like maybe there’s more of a backing for this kind of sharing.

Courtney Nash: Yeah. Well, I mean, I think a lot of us know this story. I mean, we’re, you have failures because you’re successful, is really, I think, the biggest thing and we’re… Companies in the software space are increasingly fast and more successful because the way we can build things has changed so dramatically in the last, I hate to put brackets around it, but because there’s these big phase shifts, but I mean, hi, there’s this thing called the Cloud and now there’s SaaS tools and there’s, you can whip stuff together pretty quick. The way that some of those things are built, you don’t even have to necessarily know exactly what’s happening under the hood. I am not throwing shade at anyone for this, but it is the very nature of the complexity of the systems we’re building now and it’s not just like internal complexity, right.

I’m sure PagerDuty knows a lot about the ways in which people’s upstream or downstream providers impact their own services, which then impact their own downstream and upstream folks, right. So it’s so complex that no one SRE, no one development team, no one organization can actually get their arms around how the things they’ve built really work and therefore also how they stop working. The pace and complexity of that has just gone berserker in the last, I mean, I guess I would say five to eight years.

So that’s where I get the transparency piece. Because if you just have your own problems, right, if you’re a Microsoft shipping software onto a disc, whatever, 20 years ago, you are going to figure your own stuff out and you’re not maybe going to learn a lot from other people, right. There’s still a huge amount of local context to how any business is run and what their pressures are and what their business model is. But there’s a point where I think a lot of people feel like some of us are solving the same problems over and over again, or having similar problems over and over again and if we’re not sharing that information, in light of all of this shared interdependent complexity, I mean, they’re just going to all be banging our heads against those walls over and over again.

Mandi Walls: Over and over again, solving the same problems repeatedly and some places I feel like they haven’t found the balance between what’s actually the important stuff that’s running your business, what are you actually known for and what’s just good industry practice and knowing what parts are shared by default almost across everybody, rather than the things that you need to hold back as part of your top secret sauce that you’re [crosstalk 00:09:05].

Courtney Nash: Yeah. I think the other thing is, in a weird way, despite the fact that our software engineering practices have changed so dramatically over the last five to eight years, our mentality is still decades old, at least, about how humans and software interact and the role that people play in these systems. That’s the other piece of it that I hope the VOID really opens people’s eyes to, is I always refer to these as socio-technical systems that fail and the really juicy, interesting incident reports that I feel like we can learn the most from really spelunk into not just like, oh, our Kubernetes cluster fell over, but what were the time pressures, the economic pressures, the political pressures, the resourcing pressures?

There’s so many factors that are not just which flavor are you on and did you upgrade or did you patch or not that oftentimes when you peel these back or you read these really thoughtful incident reports, people say things like, oh, well, back when we started of building this system three years ago, we only had two people on the team, a blah, blah. And then you’re like, okay, record scratch.

Mandi Walls: Yep.

Courtney Nash: There’s so much interesting information there. This notion that software is this perfect thing that we architect and then we build it and we test it, I think it’s weird to me that that’s, we have, we are managing to hold those two things in our heads at the same time. We are making these incredibly complex systems and yet we still want to act like we can have all seeing knowledge and control over them and one of those has to give and I’d argue, it’s the latter.

Mandi Walls: I always feel like you have to have a bit of a Zen moment. You can’t step into the same river twice. You are never going to step into the same application exactly the same way twice. It’s going to be a different pot of users that are in there at any given time doing different things, exploring different parts of your application. Yeah, having that background, what decisions did you make? What were you actually told to build? Maybe there was a discussion about before it got deployed and all of these other components that come into actually building software for real life versus building software in la la land or in a fantasy world.

Courtney Nash: Yeah. Yeah. My favorite, one of my favorite patterns is the things you do to remediate an incident you have often lead to the next one. I don’t have the data on this yet. I hope that, collectively, some of us will be able to put some of these data together, but I see this, in reading these, if you read enough of these, there’s at least a hint of a pattern here. That’s, the very things we do to try to fix these things add more complexity or potentially brittleness or whatever, to the system that we’re trying to build and maintain so we’re always poking around at these things.

I think one of the things that I also wanted to highlight from the whole VOID perspective, I was hinting at this earlier, is Dr. Richard Cook, who is a researcher who has brought a lot of knowledge from medicine and work he’s done in other fields to our industry. He’s the one I steal from when I say you have failures because you’re successful. It’s amazing that these things work as often as they do, not so much that they fail, right, and the reason they work is because of people. We, you, SREs, sysadmins, whoever you are, whatever your title is, keep these things running and then you fix them typically pretty quickly actually when they break. I feel like that message just, is not out there enough and that’s one of, so I think being able to highlight those stories where without us, this stuff would run, the heavy push towards automating all the things.

That’s one of the, you all probably know this better than I do from, well, it’s in your report that you all just put out, which was mirrored what I found, which is, most of the time people fix things pretty quickly, in under a couple of hours. The stuff you know how to do works. You roll it back or you figure out what changed and sometimes it doesn’t, there’s a couple of really great incident reports. There’s one that the, which was the one, it was Honeycomb had a really great recent one. They’re one of my favorite folks for how they write up their incidents but it’s the thing where it’s like, oh no, it’s the Slack one. It’s the Slack DNSSEC one that Laura Nolan just published today. I highly recommend everyone go out and read it. I will give show notes, but it’s like, we did this thing, oh, shit. We rolled it back. Okay. It got better. We decided to do again. Oh, bad. Still, another thing, rolled it back. Some other stuff happened in the middle and then we tried to roll it out again and it went bad again and we tried to roll it back and then it got worse. You’re just like, what?

Mandi Walls: What?

Courtney Nash: Everybody can feel like what that pain is but, and that one was 24 hours. But most of the time, it’s a minutes to hour, a couple of hours that you all fix these things because you’re good at what you do. You really are good at your jobs. I think a lot of, one of the things I’m hoping to study with the VOID too, is I don’t just have incident reports that people write up, these lovely ones from Slack and from Honeycomb and from whoever, I also get all the media articles in there. Not all of them, but the media articles are fascinating. This is actually something I want to study, look into next year, which is the way we talk about these things influences the way we think about these systems, right.

On just a really quick brush, if you look at it, the way that people who are involved in these systems write about these incidents and the way that the media writes about these things is a little different. I don’t think that’s helping anything. Actually, I was just reading this really cool article this morning from someone else. I love when our industry’s like Clyde, it’s this guy who’s a Harvard fellow in transportation or something but he’s railing on the statistic that I didn’t realize was rampant in his industry, which is that 94% of car crashes are due to human error. Anybody who’s driven a car is like, that’s, can’t be possible.

Because they’re these big complex systems of drivers and cars and passengers and dogs and pedestrians and bikes and really crappy streets. He goes on to make this really amazing argument and he’s saying, the media needs to stop reporting this statistic willy-nilly, they just hear it and they’re like, ooh, 94% of car accidents are operator error and then that’s out there and then everyone believes it and we have a similar problem, right. We want to blame humans for our outages and we need to stop. Stop doing it. If I had one wish, I think that might be it. Stop blaming the humans because they make it work.

Mandi Walls: I wish.

Courtney Nash: They make it work, right, and if you read these things carefully, you see that. You see the expertise, you see, oh wait, so and so said, oh, that’s weird. I saw this weird thing last week and then you listen to what so and so knows about that system and then she tells you all these other things. There are so many myths and not good practices that I’m hoping we can dispel a bit more on that front.

Mandi Walls: Yeah. Well, what’s your favorite one there? One of our recurring questions on the podcast is what is a myth about this kind of work that you’d like to set straight. So, if you’ve got a favorite one.

Courtney Nash: I have a couple favorites. I will go with, which is related to this human error one. Root cause, right.

Mandi Walls: Yes.

Courtney Nash: We look at this in the VOID. I looked at how many people claim to find a root cause, or do a root cause analysis formally. RCA, put it in the title of your thing.

Mandi Walls: Mm-hmm (affirmative).

Courtney Nash: It’s about 27% at the time that we published that.

Mandi Walls: Okay.

Courtney Nash: The numbers are changing because we’re constantly adding incidents to the VOID. We have about just shy of 2000 incidents in there now, which is the tip of the iceberg. I know. I expected that number to be bigger, partly because Google writes about this in their SRE practices, they call them RCAs. Microsoft does this, they call it an RCA, a root cause analysis. Salesforce does this. So a lot of big organizations, either actual software companies or companies that are very software driven companies, Salesforce. The reason I said I was surprised is I thought this would be something that people believe they should do and the reason I think they would believe they should do it is it sounds very appealing. Again, I have a lot of empathy for this because as humans, we want certainty and especially as engineers, whoa, good gravy.

Mandi Walls: Yes, we do.

Courtney Nash: Let us have a very concise and detailed understanding of what happened. But if we go back to what I was saying earlier, there is no spoon and once you reach that point, then you’re like, oh wow. Again, you could go back to like, okay, we didn’t upgrade X, Y, Z and then you can peel, pull all the strings and realize there’s a whole bunch of contributing factors to that thing. If all those contributing factors, latent things, weren’t present at that very moment in time for whatever was happening, then it might not have happened, which is annoyingly spooky and not what, the way that engineers want to work. But the reason I would…

Mandi Walls: No, we want math. We want the math to be right.

Courtney Nash: Also, as companies, we want to be able to say, we looked into this, we found the cause of the problem so that we can promise you that this will never happen again. I can tell you, odds are, eh, it’s true. That exact thing will probably never happen again. It’s hard because you have customers, you have users, you have people who depend on this stuff, but the belief that there is a root cause can lead to a lot of really unhealthy practices, in my opinion.

Now, am I saying that Google and Microsoft are just wrong about this? Not necessarily, but I think if we can get to a more nuanced perspective of this, some people get mad at me because they’re like, you’re just using words to mean different things. Someone said they did a root cause analysis, but really they identified a bunch of contributing factors and sure, great, maybe let’s change that language, but because it matters and because root cause so often ends up being so and so did X, Y, Z, right.

Mandi Walls: Yeah.

Courtney Nash: I mean, if you say to somebody, I’m about to go push a config change to blah de blah and they’re like, huh! They inhaled, they’re like, oh God! Config changes! Terror.

Mandi Walls: [inaudible 00:19:39].

Courtney Nash: I mean, but that…

Mandi Walls: No!

Courtney Nash: But also the internet runs on config changes. Most of the time, it works.

Mandi Walls: Yes, it has to.

Courtney Nash: I got really mad at Casey Newton, who I think is a really great journalist, but he was like ranting about, “Every week I report on config changes and yet every week software companies keep making them.” I’m like, “You do understand how the cake is made, right, Casey?”

Mandi Walls: Yeah. Yeah. Come on, man.

Courtney Nash: So I would love for people to, they want to be able to, and they want to go to their board and they want to go to their whatever and say, we found the co-, I mean, they don’t, you don’t go to your board about an incident, but you know what I mean. But when you…

Mandi Walls: Well, you hope you don’t have to go to the board about an incident, but yeah.

Courtney Nash: Well, that would be a really bad one. But when you can make that leap or when you can evolve into that mindset of, there will always be a latent set of contributing factors for any incident, then you start to have a different view of your systems and their systems, right. They have system effects. What you want to be able to understand is the boundaries of those systems, right. Where are the cliffs? What are your performance envelopes look like? What happens when, if I push it super far in the performance envelope, or if we push it super far in the whatever envelope, in Rasmus.

We could bring that one up, I’ll give you that one to throw in the links, but it’s some great research stuff that talks about the boundaries that you push up against and there’s economic ones and there’s workload ones and there’s, and engineers have some intuitive sense of these things, but not for the whole thing. It’s not a sound bite. I have to spend five minutes rambling on about it versus saying we did a root cause analysis, but I have, I have a deep seated hypothesis or suspicion that maybe someday we will also have data for à la Nicole Forsgren, Dr. Forsgren style genius. Someday, maybe we’ll have the data to show that organizations that take this view of their systems are better able at having that resilience, right. That they have a better adaptive capacity because of the way they view their systems.

Mandi Walls: Okay.

Courtney Nash: I don’t know. That is my theory and maybe time will tell.

Mandi Walls: Interesting.

Courtney Nash: I don’t know. So that’s my favorite myth.

Mandi Walls: Well, in the meantime, what can folks out there do to help with this? How do they participate in the VOID?

Courtney Nash: Yep. There’s a whole bunch of different ways and one of which is to just send me your incident reports if they’re not there. We’ll give, there’s a link to the VOID itself and there’s a big, well, it’s not a big button right now. The big button is download the report because that’s how we do things, but there is a link to submit an incident. If you’re like, all of our incidents aren’t in there, then you can shoot me an email and I’ll work with you to get them all in. I’m just one woman show so when you’re like, you don’t have an RSS feed, an API yet, I’m like, I know that. That’s the other thing people could do, is they could get on board and become a partner and help us build out more of this structure and infrastructure that we need and people to do this.

So we’re working on a new member partnership program for next year so people can get involved and maybe get early access to some other research or those kinds of things. So those, I mean, the main way is to just send us your incidents. If you’re just curious, I’m on Twitter, you can send an email. Just say hi, I like that. People say, “Hi how can I get involved, what can I do?” We’re looking for data partners, I don’t know, or people that we can do research with, I don’t know yet. I’m looking at the internet from the outside in and you can only see so much when you do that.

Mandi Walls: Yes. Right.

Courtney Nash: The folks that watch it go by see something different than we do. I think there’s a lot to be figured out there still, but yeah. Send me your incidents. Ugh, please, let’s get that thing full.

Mandi Walls: Awesome. Yeah. As I’m looking through some of the folks that are on the list right now, you’ve got, it’s everywhere. You’ve got Fastly and Datadog, but also Gov.uk and Roblox and Hulu and all these folks. Who’s lagging? There’s a lot of tech companies here. There’s a lot of tech forward companies here. Where could we use more help?

Courtney Nash: That’s a great question and I love it when people ask me questions that I don’t have the answer to and haven’t looked at yet. I got my pen and paper out right now. I’ll tell you the glaringly obvious. There’s some very obviously absent characters here and I’m looking at you, AWS. Amazon and Apple. I’m sorry, I’m just going to call you out. AWS is going to be like, but we published some and I’m like, yeah, and I only just finally figured out how to find them and y’all are sitting on so many more that you don’t share and you are at such a scale that it could be wildly beneficial to the industry as a whole. So I’ll just say that. There are some glaring open spots there that they could gladly come in and fill. As far as, yes, it’s across a bunch of industries and you see stuff in there that’s very traditional businesses that have become more software driven, right, and governments and those things. Who’s lagging? I don’t know. I now have to go look. I will get back to you on that because that’s a really good question [crosstalk 00:24:55].

Mandi Walls: Excellent. [crosstalk 00:24:57] Yeah. I mean, I like to know. I mean, that’s one of the questions I have for some of our data as well. Who’s out in the front, who’s starting to pick up speed and looking at the things across industry to see where we could be focusing on helping people more and getting their stuff back out there.

Courtney Nash: Yes. If people want shining examples of organizations that I think are doing it right, or, I don’t want to say right, because that means it sounds like everyone else is doing it wrong and that is not my goal. But I think folks are doing a really unique and interesting job of this. I already mentioned Honeycomb, but they just published another multi incident analysis of some wackiness that happened to them in September and October on the heels of another one they did from earlier this year. I had them on The VOID Podcast. Ooh, that’s another way people can get involved, but we’ll come back to that. And then Slack does a really good job. Like I said, Laura Nolan is one of the folks, one of the SRE engineers who’s involved in their incident analysis and writeups. She just did the recent DNS one. She also did the January 4th one. Remember we all came back to work this year and it was like, hey!

Mandi Walls: Oh, yeah. [inaudible 00:26:04]. Yeah.

Courtney Nash: Slack’s down. She wrote an amazing one from that and the folks who Reddit wrote up some really, really thoughtful, also, an anthology of incidents that happened when, remember the whole GameStop thing?

Mandi Walls: Oh, yeah.

Courtney Nash: So there was a whole subreddit that got just nuked, which nuked Reddit. It was that big of a deal. They wrote up a bunch of really thoughtful, in depth analyses that went into all the socio-technical factors, those kinds of things. I think those, if you look and see those, those are really great examples of people who are putting a lot of thought and time and energy into these. But if you ask them, they’ll tell you, they think it’s totally worth it. The investment’s 100% worth it in terms of what they learn about their systems. That’s what’s cool. The Reddit one, they talk about near misses and things they did based on things they’ve learned in the past that actually worked. Was like, oh, why do we invest more time and energy and understanding how these things work. So that was really cool. They’re also on the podcast. So I’ll do that. Here’s my plug.

The other way you can get involved is if you have written up an incident publicly, you’ve published it and you want to come on the podcast and talk about it. I know this is hard. Lawyers, PR, comms team. I’m patient. I’m stubborn. I’ll keep asking until you explicitly say no. If I can get Reddit to come on and talk, I can… Come on, people. Come and talk to me about this. It’s possible to do this and not reveal state secrets, but listening to the people who’ve been involved talk about it has been just absolutely amazing. Just for me. I’m in a, mostly, I just sit there and don’t have to ask any questions because just have all these amazing things to say. Come on the podcast. Come tell me about your incidents and that’s the other way you can get involved.

Mandi Walls: Awesome. Sweet. All right. Well, we’re just about out of time. Is there any other parting thoughts or anything you’d like to, any advice to give to folks that aren’t sure what all this is about and what to get started with? Any good first incident to read that’s on the list?

Courtney Nash: Yeah. I mean, honestly, go to the VOID, type in Honeycomb. You can look, there’s a filter on there that says ‘Has Expert Commentary’. Go look at those. There’s a handful of incident reports on the VOID that have expert commentary on them. If they’ve been on the podcast, they’re flagged in there. We’re always adding to that. I will be talking more about these things, so I have a newsletter if you want to sign up for that. But I think the parting shot I want to have that is not a promotional shout out is just to, again, remember that you, the people who build and maintain these systems are very good at what you do. The vast majority of the time, all of these incredibly important systems stay up and running because of you. We applaud you and we thank you. When you are fighting fires, try not to forget that.

Mandi Walls: It’s definitely a lot of appreciation for all those folks out there. I no longer have to run systems myself and I am very glad of that, but have the scars from those days. I’m very happy for the way the industry has matured and is progressing and shout out to all those folks out there that are keeping it all going.

Courtney Nash: Yep.

Mandi Walls: Courtney, thank you. Thank you so much for being on and sharing everything about the VOID with us. We will have plenty of things in the show notes for folks who are interested in learning more and reading all those things and hooking up with the VOID and doing all that great stuff.

Courtney Nash: Yes. Thanks Mandi. This was fun.

Mandi Walls: All right. So thanks everybody for joining us this week. I am Mandi Walls, we’re signing off and we are wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two. Thank you so much for joining us and remember: uneventful days are beautiful days.

Guests

Courtney Nash

Courtney Nash (she/her)

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.

Hosts

Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.