Rerun: On-Call Nightmares With Jay Gordon

Posted on Tuesday, Oct 19, 2021

Jay Gordon is the host of the popular On-Call Nightmares podcast. Matt and Jay discuss some of the stories Jay has heard, as well as how on-call has changed over the years.

Transcript

Matt Stratton: (Silence). Welcome to Page It To The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry, to improve both system reliability and the lives of the people supporting those systems. I’m your host, Matt Stratton, @mattstratton on Twitter. I’m joined today by Jay Gordon from Microsoft, who’s also the host of the popular On-Call Nightmares podcast and that’s exactly what we’re going to talk about today, on-call nightmares. So thanks for joining me, Jay.

Jay Gordon: No problem, Matt. It’s a little bit before the end of the, what I would call, working year for a lot of us. It’s December 23 while we’re recording this and the cool thing about it is I got some time to chit-chat with you.

Matt Stratton: Yeah. So I think we get started, maybe if you could tell our listeners a little bit about your journey. Your show, On-Call Nightmares, just turned a year old. You’ve been doing this throughout 2019. Just at a high level, tell us about some of the stuff, what’s the purpose of your show and what’s the topic? And then we’ll pivot that into talking about some actual on-call nightmares.

Jay Gordon: Sure. Around this time last year, I got to this point where I really thought that I was hearing a lot of people tell me war stories when I would go to places. I go to DevOpsDays, go to other conferences, hang around at the bar after, war stories became something that I just got so used to hearing and telling and so, it got to the point where I started thinking, all these conversations we have the bar, why is no one recording them? And so, I decided I was going to start recording them and I did so. I guess it’s been now 45 or 46 times. I’ve done a lot of them, I’ve spoken to people that have really helped change my thoughts about on-call, and I’ve spoken to people that have also maybe reinforced some things that I thought were right about on-call.

Matt Stratton: What do you think are some of the popular myths or misconceptions about on-call that you see?

Jay Gordon: Well, I think from the highest level, that on-call is just an extra part of your SRE, or system administrators work, that it isn’t a real big part of their duties. It’s always sometimes in the years past, at least for me, it was looked at as extra work. You’re just part of the on-call, just a part of it, and it wasn’t always taken very seriously, especially the impact of on-call on the individual and what you really can lose for being in a really nasty incident. You can lose time, you can lose self-confidence, because maybe something took too long, didn’t go well and you can a lot of patience and even confidence from the company when these bad things happen.

Matt Stratton: So, one of the things that we talk about a lot is that on-call is not just for SRE, it’s not just for ops although, you and I both have a background in ops and probably most of our career would have been carrying a pager, so to speak and spend it in that ops role. Have you spoken to folks on your show that are not, what you would consider, the traditional on-call type roles?

Jay Gordon: Yeah and the first one that always comes to mind and I think that in this particular conversation was influential on me using this term a lot and it’s Andrew Clay Shafer using the term, the conscientious developer. It’s what he was before the word Dev Ops ever used by him once. He told me that he considered himself a conscientious developer, and that when on-call incidents incurred, he would be part of that. And because of that, it grew him into a more capable software developer and grew into what eventually became DevOps. The idea that you can break down that wall and have people who write software understand the operational rules.

Matt Stratton: I think one of the things and you and I talked about this before, I spent a lot of time talking to organizations where they want to move to this idea, more around full service ownership, which is more than just putting Devs On-Call, but putting Devs On-Call is part of that. And a lot of folks who are in a role of a software engineer are very resistant to going on-call, which I can understand and I can appreciate and I think a lot of what I’ve seen is this resistance comes from most software engineers have a perception of what being on-call means and what that experience is like, and it’s because of the horror stories that they’ve heard from their friends working in ops. And I think when we think about what moderate on-call could be like, it’s very different. What are some of the things you think that have changed and what it means to be on-call today, here at the tail end of 2019, versus the last 15 or 20 years when you and I have been doing it as well?

Jay Gordon: Automation, without a doubt. It’s automation has made so much of the difference, because there are ways to run through certain automated workflows and follow through those, and how well a lot of that automation has been documented, so that you can go through documentation to pinpoint what exactly is the problem, as opposed to having to search through a haystack and find that one needle that’s the problematic issue. So, when I add automation, I add in all the tooling around automated deployments and management. So, that’s got to also talk about the observability tools, the things that are watching what’s going on through the automation process. It’s watching through the deployment process that’s keeping an eye on everything. So, I think, you’re automating everything, you’re installing agents that do greater amounts of monitoring, and you’re having greater ability to spin up systems and replace things much faster. So I think automation made just a huge difference.

Matt Stratton: I think another piece is the way that culturally it’s changed and it’s also why full service ownership really comes into play is it’s not this model of, you have a team that’s on-call, that’s on-call for everything that might be happening in your business, it’s being a little more directed towards the domain experts for that particular service and so, I think that’s the difference. If I’m a software engineer that has created this one service and I’m on-call for that, that’s a very different experience than… there’s this admin, who was on-call for anything that went wrong with the entire application, if it happened to happen overnight. So I think understanding that difference and also the more folks that go on-call, the less amount of time you spend on-call as well. That’s a big part too, is we distribute this, I think we see a big difference in that.

Jay Gordon: Well distributed blast radius, wouldn’t you say? If you think about it, what you’re doing by adding these additional people into your on-call processes, you’ve distributed what your blast radius potentially could be. If it’s a distributed type of problem, if you have a number of different domain experts all on it at the same time, repairing an issue, you’re going to get a lot further along in repairing that issue, than you have a small team working on one big system dealing with all this. So, I think the way we’ve kind of spread out as people, we’ve gone to things like microservices, but we’ve also spread out the amount of people who are actually looking at all those services, and we’ve reduced their blast radius by spreading it out a bit more.

Matt Stratton: I think one other thing that I tell folks who’ve never been on-call before that are leery about going on-call is, so the beautiful thing about being part of an on-call rotation is you get to go off-call because I have news for you, if you’re not on-call, you are. People know how to get ahold of you and the difference is they get ahold of you whenever they want, as opposed to being able to be as part of a formal rotar, where it’s like, “No, you know what? I’m actually,” I mean, it’s very relieving to be able to know I’m not actually on-call right now, which means I don’t have to even worry about somebody calling and asking me about this. Otherwise, you have that Damocles of the pager hanging over your head, even though you’re not carrying the pager. Trust me, your ops team, your folks, they know how to find you and they will, because you’re just the person they know, as opposed to the person they should be contacting.

Jay Gordon: Yeah and that’s why I throw automation into there a little bit more, because you’ve automated the tooling about rotating who’s going to be on-call when, and it’s less for you to think about in the process. I remember going through a manual process of having to modify config files, and Nagios, monitor and then alert one particular person. So we’re going back into times where the lack of automation around a lot of the things that we did made an on-call process just that much more difficult. Everything from producing a new system, if you had to do that by scratch and by hand, it just takes sometimes hours doing Linux install that installed all the associated but once we got to the automated installation process, less pain. And so you can almost attribute the same kind of advancement with on-call with once we got out of the thing of really like the spreadsheet driven on-call rotation, and you went into tooling that was better managing when managers know people are on-call that they have a few, that really made a huge difference I think for me and my confidence around being on-call. Because at least like if I had incidents and they were clicked off in the tool we were using and in some cases, it was Pager Duty. They would see in the reports how busy we were in times where were things sucked and I think that that’s one of the big things that developers should understand has been a big problem around the idea of on-call, is that people have always said, well on-calls sucks. And it’s not really that idea that you should feel is why you may not want to be on, on-call. You may not want to go on-call because you just don’t really want to have it interrupt your off time and that’s really what it comes down to. So you should take the job that fits whether or not you can do that and there are a lot of rules out there that are very obvious that they want you to be on-call and it’s a big part of your job, but you don’t have to take those jobs. So I think that, that’s one big thing that we need to recognize, like it’s a choice to be on-call, so that labor needs to be treated as labor. It’s a lot of work to be on-call.

Matt Stratton: I think I will agree with the second part of what you said, which is that you compensate for that. I think that the danger, of the job is that that’s, what’s happening that as we go through changes, they’re like, I didn’t sign up for this and that’s the problem. The problem is not with people taking a new job. The problem with this change is when an organization is trying to actually distribute that work and you have your existing people that say, well, I chose to be a software engineer because I didn’t want to be on-call because I care about my family, by the way, I really hate that. That’s why I’m being opinionated because apparently only software engineers may have families [crosstalk 00:11:26]

Jay Gordon: Oh yeah, and I don’t think it’s about the family thing, what I think personally and the reason why I make the difference and I completely get your point. I think that some people literally just don’t want to be bothered about it after they’re out of the office. I’ve met these people and it has nothing to do with what their personal life is or who they’re married or not married to or have children with or not. I just met people that didn’t want to do work after hours.

Matt Stratton: So I think the difference of that is, it’s not choose the role, choose the company and it needs to be a company that stops doing business after 5:00 PM. And that needs to not be because I am a developer or I am this. Again, as Jay said, there are jobs that exist and I’ll tell you an interesting story. I was at a DevOpsDays and we were having a conversation around incident response because that’s what I talk about and there was a fellow in there who ran systems for the Indianapolis public library. And he was like, you know what, we don’t have after hours issues and he was like, you’re right, you sure don’t because you’re a library and at the end of the day, nothing happens till the morning and that’s fine. And I think that’s another thing for folks to understand, it’s not every organization is a 24/7 organization, but if you are, that’s part of the job.

Jay Gordon: Exactly.

Matt Stratton: If the service that your company provides needs to be available at that, then everybody has a part in that. So I think that’s the nuance maybe around, if you don’t want to have to potentially be on-call sometimes then you should get a job for a company that doesn’t actually do business after hours.

Jay Gordon: Yeah, that’s kind of a better way of putting it and essentially you need to take the roles that fit you, your lifestyle and how you want that to interact with what you’re going to go on and have part of your life when you’re not working. So for me, I’ve never really had that issue. I’ve always worked in situations where after hours was fine. My personal family situation always was worked out never really a big issue and that’s why I think it is for a lot of people who do this role and have families of different sizes or types. In the end we’re all just people and we all have basic requirements in order to accomplish basic things like eating, having some water, getting enough sleep and spending time sometimes with people we like and I think we should keep it in perspective. These are simple things and if I’m able to get all that done and I’ve got some time to do the on-call and feel comfortable about myself, that I’m accomplishing all these basic things than I’m glad. On-call shouldn’t take away from any of the basics. So that’s what I’ve kind of getting at here, is that you should be able to eat, you should be able to drink and you should get some sleep at some point. On-call should allow you to still do those things because one of the big advents of on-call that I’ve seen in the last few years, and this is tremendous, I think is the way you’ve been able to do escalations in a lot of systems built in or how it’s not just built into a system, but built into the rotation. I’ve been on-call for four hours on this particular incident, it has not reached a point where I’m going to do anything, let’s move up the chain and having that in a documented and understood fashion where it’s not just you get the phone call eventually, you know what I mean?

Matt Stratton: Yeah. I think you nailed it, is that it’s not about this binary I will not do on-call but it’s like, these are the requirements I have to be part of the on-call culture. How do we make this better? Which is okay, what is the realistic nature of the rotation? And so those of you who are thinking about going into jobs, where on-call as part of it, whether you’ve been doing this for your whole life, or maybe it’s coming new, here’s a good question to ask, which is how are responders rotated off of an incident? Because that’s a big thing. Because here’s what sucks about on-call. This is the nightmare that everybody thinks about is the, I got paged at one in the morning and I worked in incident until 6:00 AM and then I had to go to work. Okay, you know what? That super sucks or even worse is, I got paged at 6:00 PM on Tuesday night and the incident ran until the next morning and I was up all night. I got news for y’all, that is not a good way to run an incident. You should be being rotated off because responders stopped being effective after a couple hours. So you want to ask questions about that not, will I be on-call? But find out what it’s about. Ask questions about, like Jay said, what’s the escalation, how does that work, what’s the size of the rotation? That’s more important than asking questions like, how often do you get paged by the way? What do you care about is not how often it happens, but what’s it like when it does. You expected to work a 12 hour incident without any kind of break or things like that. Those are things to watch out for.

Jay Gordon: I like to think back to an incident that I had a few years ago, it was many years ago where it was at the point where I got paged at like 3:00 AM. I kept having the problem and it’s 07:30. I had to be out the door to get into the office by at least 08:00 and eventually I had to beg someone who was going to be on shift next. And see that’s the problem with not having solidified rotations with escalation points that you may end up having to be that person that’s begging for someone else to take over so that you can go get three hours of sleep and shower, so you can come into the office. And I thought that this was a really awful way, but the kind of normal way that most of us, especially houses that have ton of different customers, like in hosting back in early August, late nineties, it was just someone work on the problem, you get phone call when the problem gets really bad. And then you basically take ownership until it’s fixed and if it’s not fixed then you beg someone who you may know who can come on and fix it best. This was no way to run a business, but a lot of companies got away with it because they were able to explain away downtime and nowadays it’s very difficult to explain away downtime.

Matt Stratton: So thinking back through a lot of the stories you’ve heard from doing so many episodes of your show and you’ve heard all these great nightmares, what are some of the maybe patterns that come to mind or things that people could do to help avoid having an on-call nightmare?

Jay Gordon: It always comes down to tech debt. It’s so amazing but a lot of the stories that I’ve heard from people come from tech debt. They talk about issues that they were looking for, for years or never thought about and it becomes a self-fulfilling prophecy of eventually by not handling some of your technical debt, whether it’s someone with a database that’s been hanging by a thread or somebody commits a command on a production database because they were configuring it to a test that they were trying to run. These are all the things that people have, by not having a specific environment to do development, if sometimes people would do things and it would end up going against production. These are like all those real scary things that I’ve heard people talking about. So not having documented information, not having documented processes and having it all eventually wind up is additional tech debt. It’s amazing how much technical debt is actually lack of documentation and I find that to be terrifying as well as that there are services that are up, they’ve been on for years that no one’s documented and it ends up being one of those scary parts where if it ever falls down, nobody knows what to do.

Matt Stratton: What is your personal favorites on-call nightmare and that can either be one you’ve heard or one you’ve experienced, but what’s your favorite? Whatever version of favorites that is.

Jay Gordon: I guess personally my favorite incident that I learned a lot from probably was my situation with the dress when I worked at Buzzfeed. So if you don’t know what the dress was, the dress was a situation where a woman emailed a coworker of mine and said, we can’t figure out what color this dress is and the person Cate’s Holderness eventually shared it on Buzzfeed and Twitter, it went viral and I had to work on the web servers associated with this poll that they were taking, but it was the same production CMS that we were using for what the news people were using to do like important news stories. So we had this huge problem with nobody being able to write to the database while a whole group of people just wanted to vote about what color they thought a fricking dress was. And so that to me was amazing because I learned about traffic in a scale I’ve never seen before in any place I had worked prior and the nightmare of it all was, it was in the same day, we had an incident around Lamas, that a bunch of Lamas got like loose in Arizona. So these things produce viral moments on the internet and viral moments can then have impact on your operations plan. And your ops plan might’ve been, you go after, you fix a web server or two and everything goes back in a product, but when it’s at a particular level of traffic, that it’s something you can’t really plan for. It’s amazing how that kind of incident gives you the ability to finally use those reps that you’ve built over the years of like, I built up a certain set of skills where I’m supposed to look when these problems happen, to say there’s a certain set of skills. I haven’t seen those movies, but I thought that sounded cool there but you know what I’m getting at?

Matt Stratton: Yeah absolutely. So one last nightmare question. I don’t know if anything comes to mind, but what’s the funniest on-call story you’ve ever heard?

Jay Gordon: Tim Yoakum from Influx DB. Tim told a story about water cascading into the data center. It was pretty amazing, he was in the data center and there’s water cascading from the ceiling onto the racks and there’s water shooting out of floppy drives is what he’s telling me. It’s one of the most wild stories I’ve ever heard when it comes to a data center incident. I mean, we’re not just talking about a few computers getting wet, we are talking about a full blown shower down on a couple of data center racks all at the same time. It was pretty wild, if you want to hear the rest of the story, check out episode 30 with Tim Yoakam of Influx. I think it’s really a great kind of… you can’t predict because on-call can almost produce a number of different issues, this happened to be just a wet and wild one.

Matt Stratton: All right, fantastic. So. Jay, there are two things that we ask every guest on this show. So the first one is, what’s one thing you wish you would’ve known sooner when it comes to running software in production?

Jay Gordon: That the people who wrote it gave me all the information that I’m going to need when it goes wrong. So that’s if I’m in an operations role, I want to know that I know how to get things taken care of at any time without having to haul or escalate. As long as you’ve documented it and you think I should be able to recover it, that’s the most important thing.

Matt Stratton: And secondly, is there anything about running software and production that you’re glad it did not come up in the show?

Jay Gordon: Yeah. Like invalidating cache. Invalidating cache, if you asking me, what’s the difficulties about invalidating cache across a global like CDN or something like that. I’ve done that kind of stuff and it sucks. If you’ve got a particular object, that’s on a bunch of CDNs across the globe and it’s one of those things where you sometimes have to programmatically do it with some of the big vendors. It’s a real big pain. I don’t miss that stuff at all.

Matt Stratton: So pro tip to my fellow Page It To The Limit hosts do not invite Jay on our show about cache invalidation, not that we were planning on having one, but maybe we need to now. I don’t know if there’s 30 minutes worth of stuff to talk about with that, then saying it sucks over and over again.

Jay Gordon: It’s interesting if you think about it on a really grand scale. In the case of streaming video and now how many other systems rely on heavy caches. It’s amazing to think how much goes into supporting big caches for those big content delivery networks. So I want to hear someone talk about it one of these days, I got to find a talk about it.

Matt Stratton: Just have it not be you.

Jay Gordon: Exactly. I want to hear it, I don’t want to do it.

Matt Stratton: So great, thanks Jay. Where can our listeners find your show, find you, tell us [crosstalk 00:24:42].

Jay Gordon: Sure.

Matt Stratton: Or more.

Jay Gordon: You can find me pretty easily on Twitter. It’s On-Call Nightmare or just go to oncallnightmares.com. And if you want to find me personally, Jade Destro, J-A-Y, D-E-S-T-R-O on Twitter. It’s really easy to find me, it’s an old IRC name, but my name is Jay Gordon. You can always find me at a lot of the different Microsoft and Azure events like Ignite and Ignite-The Tour. I’ll be around a bunch of them all over the world, this coming year and I’m looking forward to seeing new people, especially at DevOpsDays New York City in 2020 on March 3rd and fourth.

Matt Stratton: Fantastic. Thanks again, Jay. I appreciate it. This is Matt Stratton wishing you an uneventful day. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor Pager Duty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes @pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number 2. That’s pageit2thelimit. Let us know what you think of the show. Thank you so much for joining us and remember uneventful days are beautiful days.

Show Notes

“All these conversations at the bar…why is nobody recording them?” - Jay Gordon, the host of the popular On-Call Nightmares podcast, talking about where the idea for the show came from.

Popular myths or misconceptions about on-call

One of the biggest myths is that on-call is just an extra part of a SRE or sysadmin’s job. That it’s not really a big part of their duties. It’s just a thing you do; it hasn’t always been taken seriously, especially the impact of being on-call to the individual.

Remember - on-call isn’t just for ops or SRE. Andrew Clay Shafer used to describe himself as a “conscientious developer”, even prior to the ideas of DevOps. Because he thought about things this way, it caused him to be a better developer, and this heavily contributed to the foundation of the DevOps movement.

Software engineers are often resistent to being on-call because of what they think it means - based on the horror stories they hear from their coworkers and friends who work in Ops.

How has on-call changed?

Jay: “Automation has made so much of the difference”

Well-documented automation makes it easier to track down what might be contributing to issues. Having things watching what is going on through the deployment process and watching what’s going on. We have a greater ability to spin up replacement systems, too.

We are changing from a model of having one team who is on-call for everything inside the business; now it is more about selected domain experts on call for the thing they know really well. Being on-call as a developer, you know you are only being called about things you know about. Additionally, the more people that go on call, it’s much less actual impact to all the folks who are on-call. So the experience is a lot different. “We’ve reduced the individual blast radius by distributing it” - Jay.

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you - you’re always on-call” - Matt. It’s very relieving to know you are not on call, so you don’t have to worry that someone will call you. “Trust me - your ops team knows how to find you, and they will” - Matt.

On-call requirements are different

Not every company or service requires 24 hour on-call support. When you are thinking about where you want to work, consider this. That said, if you do work for an organization that provides a service around the clock, on-call is likely a part of that job, and everyone should consider it part of their service ownership. But ultimately, make the decision for the role that works for you. It’s less about the title or role, than it is for the type of company or organization and what they need. As Jay points out, “in the end, we are all just people, and we have basic requirements - like eating, having water, getting enough sleep, and spending time with people we like. On-call should still let you do these things”.

A good question to ask when getting into a role that has a on-call component, is ask “how are incident responders rotated off of an incident?” Responders stop being effective after a couple of hours - understanding things like “what’s the size of the rotation?”, “what are the expectations of a responder during an incident?”, are much more important to know than “how often will I get paged?”

How to avoid having an on-call nightmare

Jay: “It always comes down to tech debt. It’s amazing how much tech debt comes down to a lack of documentation. It becomes one of those scary parts that if it falls down, nobody will know what to do”

Additional Resources

PagerDuty Home Page
Episode transcribed by Rev

Guests

Jay Gordon

Jay Gordon is a Cloud Advocate with the Microsoft Azure Advocates. He and the rest of the Advocacy team are focused on helping Developers and Ops teams get the most out of their cloud experience with Microsoft Azure. Prior to Microsoft, Jay was part of teams at DigtialOcean, BuzzFeed and MongoDB. Jay lives in New York City with his wife and has a goofy pug named Rico.

Hosts

Matt Stratton (He/Him)

Matt Stratton is a DevOps Advocate at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader DevOps community, and back in the day, his license plate actually said “DevOps”.

Matt has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He is a sought-after speaker internationally, presenting at Agile, DevOps, and ITSM focused events, including ChefConf, DevOpsDays, Interop, PINK, and others worldwide. Matty is the founder and co-host of the popular Arrested DevOps podcast, as well as a global organizer of the DevOpsDays set of conferences.

He lives in Chicago and has three awesome kids, whom he loves just a little bit more than he loves Doctor Who. He is currently on a mission to discover the best phở in the world.