Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson, @Julie_Gund on Twitter. Today, I’m really excited to have on our show, Mandi Walls, our new devops advocate from PagerDuty. Mandi, do you want to go ahead and introduce?
Mandi Walls: Yeah, thanks, Julie. My name is Mandi Walls. If you’d like to find me on any social media, I am lnxchk. Today we’re going to talk about some of the things we should be thinking about with COVID. What it means for new normal for folks working in running software and what it means for people in operations. We are joined today by J. Paul Reed. He’s a senior applied resilience engineer at Netflix, and get us started.
J. Paul Reed: Yeah, thanks. Happy to be here in our COVID lockdown home offices.
Julie Gunderson: Well, J. Paul, I have to tell you that Netflix has really personally saved me during this lockdown, but before we get started talking about really what COVID means for the new normal, I would like to also address other very important topics with you.
J. Paul Reed: What have you got? Let’s go.
Julie Gunderson: One of them… Yeah, it’s season two of Messiah, apparently Netflix did not renew this, and so obviously this is something that you can absolutely directly take care of for me. Is that correct, J. Paul?
J. Paul Reed: That is correct. What I will do is I’m going to go file an incident ticket right now, and the summary will be Julie needs Messiah season 2 stat, and we’ll get that going for. No, it’s funny too, because I have been at Netflix for… I think it would be 11 months and all of my friends are like, “Hey, can you get another season of that show?” That’s a whole other division that I’m not in, but we can still work on it and see what we can do. By the way, I liked Messiah too. I liked that it had that last episode, had a cliffhanger-esque feel to it, didn’t it?
Julie Gunderson: It really did, and I feel like there are unanswered questions, and it’s interesting, I think I’m not alone in the world in the fact that streaming services at least brought more to our lives when we’ve been stuck at home. Right? I’m kind of curious, like that’s a pretty big burden on Netflix to make sure that there’s all the bandwidth, even just to think about everybody. Do you want to talk a little bit about that?
J. Paul Reed: Yeah. Yeah. When we kind of had the initial lockdown in March, I mean, we certainly felt, I think as a CORE team… A little bit, I didn’t get through this, so I’m on the CORE team and CORE stands for Critical Operations and Reliability Engineering. What we do, kind of the simplest way to explain it is, we are on call for Netflix. If Netflix is down, we are pretty much the first in the company, down for our customers, we’re pretty much the first in the company to really get an alert about that. We’re kind of that front line. Humorously, speaking of COVID, my mom… We’re all picking up interesting COVID habits, right? My mom, her thing to do that she’s doing right now is if she sees any of her friends post on Facebook or something that Netflix is down, she texts me, “Is Netflix down?” She’s been helping the team make sure that Netflix is up and available. But I think, we all, at Netflix, kind of felt the pressure of that when the first kind of lockdowns were in place, because people were relying on us to bring them that moment of joy in an otherwise very stressful and very rough situation, being cooped up, and a lot of uncertainty. We saw a lot of people turning to shows that are their favorite kind of just escapist type things. So we felt that that pressure to make sure that the service was stable for all of those customers, and so as a CORE team, one of the interesting things that we did really early on is we actually treated COVID, the situation, as an incident, so we have an inc ticket for it. That was a long running incident. I mean, can you imagine? I’m trying to remember, we kept that incident open for, I want to say six weeks. We treated it as a full blown incident, the way we treat any other incident for six weeks. We had daily meetings on what was your kind of… I think, in the calendar they were called it now, your moment of COVID-19, because I put that on the calendar. But yeah, we got together and looked at what were our challenges for the day around this situation and this kind of unprecedented time in our history, and both for world history and for Netflix as well.
Mandi Walls: Yeah. A long running incident like that, and that’s crazy, but it turns into, this is now going to be normal for a long time. What kind of challenges are you seeing converting from the thought, “Well, this is an incident and this is going to end,” to, “This is potentially permanent.” What does that look like?
J. Paul Reed: Yeah. Yeah. What we were looking at real early on, is the, are we able to serve streams to our customers, right? Are we able to provide those, we call them moments of joy, right? Which again, really needed in that time. A lot of the initial focus was really on, what is our impact on the internet. Right? This was kind of publicly reported, especially in Europe. Some countries asked us to actually use less of the internet during this time, because so many people were streaming, so we had to do some work around that. There was a lot of coordination for teams across the company to make that happen, because we wanted to make sure that we weren’t harming the internet when people were using it for work stuff. All of that became a really critical part of the infrastructure. But so there was a lot of focus early on, on sort of technical challenges or technical things. One of the things that we looked at, as a thought exercise, a part of the incident, we went into, what do we think could happen? It was really an exercise that was focused around… It was a planning exercise, right? That whole idea that plans are worthless, but planning is super important. We went through and looked at a number of various scenarios that were actually driven by COVID. Right? They were things that we may have thought about in the past, but with the new context of COVID, we wanted to look at them again, so we did some scenario planning there. What we sort of started to realize though, as we got to your point, Mandi, we’re a month in, things are still locking down. Two months in, things are still locking down, and then three months in they’re starting to come back and now we’re four months in, and then now they’re locking down again. Right? One of the things that we saw about month two, is we started to shift our focus less, and I should say this from a CORE perspective. What we’re thinking about in this incident that we’re talking about. We started to shift our perspective less around technology, and systems, and making sure they’re stable and all that, because we had some good evidence that things were going to be fine, right? At that point. We started looking more at the people impact, right? The impact of people on call, the impact of working from home, and there’s been a lot of people talking in our industry and throughout other industries, if you are working from home, what’s that like? We’ve all had the, if you remember that guy in the… What was it? The BBC, and-
Mandi Walls: Yeah. The Korea reporter, yeah.
J. Paul Reed: Yeah, Korea. Yeah. Where the daughter comes in. Right? We’ve all had that moment with coworkers, and it’s funny, right? For me, because I had a roommate for 10 years and he moved to take a job in LA right before COVID, so I’m actually in the apartment all alone now. It’s so funny, because my coworkers are like, “Oh, being alone must be great.” It’s like, “No, it’s like just grass is always greener.” Right? But we started really looking at, what are the… From an operations perspective, right? You may have an operations team. If that operations team was understaffed or underwater, when COVID happened, now you’ve got a whole other set of problems to think about with that. We really just started kind of looking at… All right, you’ll hear me say a lot, the word socio-technical systems, because that socio part is about the people in the system, and the people responsible for keeping it running, and the operators, the ops folks. We started looking less technical, more socio. That didn’t mean we ignored the technical part, but we started looking, all right, what’s the impact for folks, people?
Julie Gunderson: Let’s talk a bit more about that impact. I mean, just surface level impact. One of the things that I think a lot of us have seen, that I’ve heard, is that it seems like we’re doing so much more work now. Right? It seems like people expect that infinite capacity, because you’re at home, everybody’s in front of their computers. We’d know that people aren’t necessarily taking vacations as much as they should. Although PagerDuty is strongly encouraging us to make sure we’re taking even staycations, but Zoom meetings…
Mandi Walls: I never want to hear the word staycations ever again. Just that.
Julie Gunderson: I know. Let’s let J. Paul go ahead and close out this incident, so we never have to hear it, but what are some of those impacts that you’re seeing?
J. Paul Reed: One of the things that you said, that was really interesting, and we’ve been talking about this on CORE and parsing through what we’re seeing on other teams. One of the things that I should mention, is the way Netflix does kind of service ownership, is it’s the standard kind of, you write a service, you’re on-call for it. If you’re on an engineering team at Netflix, you are likely going to be on an on-call rotation for your team. So the CORE team will page you into an incident. We use PagerDuty for that. One of the interesting things is that means we have a really large dataset of what people are experiencing, what we’re seeing with paging rotations and that sort of thing. We’ve been starting to kind of parse through that. We also have started kind of, we actually have a kind of monthly kind of socio-technical systemic risk meeting. We’ve started actually talking about the impacts of working from home. One of the things you brought up though, that I thought was really interesting, is you said, “It feels like we’re all working more,” right? As soon as you said that I was amused, because we’ve all heard this, I’m sure, “Time has no meaning.” Right? It’s like, are we working more? Does it just feel like we’re working more? Like time has no meaning. Do you remember that news studio, or the local news broadcaster that they had the weather guy, just like, “What day is it today?” He would do this little… It was like, “It is Tuesday.” Right? But we have noticed that, very interesting, similarly to technical systems, there’s a difference between capacity and availability, right? A system can be highly available, but have reduced capacity. Well, people are the same way, and what we’re finding, and I think this certainly resonated with me when we started talking about it, is that we may be highly available or as available as people expect us to be. We might be eight hours online in our home office, or whatever the case may be, but people’s capacity is reduced during this, because of the stress of COVID. Then when, the Black Lives Matter protests, and all of that was happening on the national stage. That was a huge capacity impact. One of the things that we’ve started to talk about is parse apart that distinction between availability, I might be available, by the way, I might not, I might have to go take care of my kids, so maybe I can only work half time. I know some folks are impacted by that. That’s an availability question, but there’s also something to that around, “I may be available, but my capacity may be reduced,” and so teams really should not conflate those two. They’re not the same thing. That’s, I think, important for realizing where we might need to give our team members a little bit of grace, and where we can help them be as effective as they can be given the constraints that we all now have.
Mandi Walls: Yeah. Along with that, do you feel like that’s going to change the way people build their teams in the future? I mean, we’re kind of assuming that the pandemic’s not going to last forever, but the pundits are convinced that working in an office is over for a lot of people. That a lot of the structure of the way folks do their work is going to change. How does that roll into your day-to-day operations? Do you feel like we’re going to need more people working in operations to deal with these availability and capacity issues?
J. Paul Reed: That’s actually a really interesting question. I think one of the biggest changes that we’ll see with the remote stuff, and you see the large companies, Facebook, and I mean, even there’s discussions in Netflix, right? Around, do you need to be on-site? Do you need to be in the Bay area? Can we do remote? Is there a value in doing remote? From a purely tech perspective, the longer that COVID goes, I think you’re going to start to see organizations that traditionally were like, “No, you have to be on-site. They may return to that, and I want to be very clear, there are discussions in Netflix about it, but because we can’t predict the future, there’s no conclusions, but I think there are discussions around it, and so you might see more sort of distributed ops teams. That is one of the things, interestingly, that we have seen as a challenge in some of the incidents that we’ve seen during this period, around coordination and work coordination, when you’re responding to an incident, right?
Mandi Walls: Yeah, sure.
J. Paul Reed: The cost of that has increased, and there can be, if you’re not used to that way of managing an incident, it can be pretty difficult. Now, one of the things that we, as a CORE team at Netflix do, is we actually manage all of our incidents on Slack. Then when we were in the office, we actually had an office or a conference room that was our kind of battle command for incidents if we needed it. What’s interesting though, is because we practiced doing incident management on Slack, as much as possible. I mean, that was our de defacto way of doing it. We have been able to sort of pivot pretty easily to having everybody just be on Slack from home, but there are teams… What we found though, is a lot of times we might manage the incident on Slack, but other teams that got paged in, they would huddle up in the office together. There might only be one person on call, but they would huddle up and then work the problem together. They’re having to find different ways to coordinate, that works for them. For some of them, that’s a Hangout. For some of them, they’re able to kind of reason about it in their own Slack channel, but it still has a different tone and tenor to it than when you’re all literally sitting in the bullpen of your team at the office. That’s one of the things that we actually see as an impact there. If we look at, everybody’s going to lean into doing remote, I think there’s going to be some challenges there from an incident response perspective, because I wouldn’t necessarily say it increases the cost of coordination. It probably does, if you actually look at it, but it certainly changes what you need to think about as you’re coordinating. If you don’t have a lot of practice doing that, that’s where I think you’re going to see more people trying to solve that problem, and drilling on that and doing that kind of work with their teams.
Julie Gunderson: You know, one thing that we talk about at PagerDuty a lot is the incident commander role, right? Kind of bringing all those folks together in that coordinated response, but being an incident commander is stressful. COVID is stressful. Incidents are stressful. Working from home, layers of stress upon stress. What are ways that maybe we can work to reduce this? Or why is this also stressful still? Aside from COVID, we know that.
J. Paul Reed: Yeah, yeah, yeah. That’s a good question. I’m going to throw the question back at both of you, because I always find the answer interesting. When you’re doing incident commander stuff, or you’ve been involved in an incident. Why is it stressful to you? Have you ever delved to think… I mean, because what you said is totally accurate, it is stressful, but why?
Julie Gunderson: For me, I think it’s because you’re constantly thinking about that impact to the customer, right? What is that customer experiencing? If I’m experiencing this stress, can you imagine what it’s like for them? You’re thinking about that, and then also, there’s the thinking about, what’s the impact to the business? There’s all of that. Everything’s coming down and it just seems like it’s the most important, well, it is, the most important thing at that moment, and it’s that need to get it resolved quickly. I mean, I wonder, can you remove that stress outside of practicing, and having defined roles, and defined rules of engagement? I mean, how else can you make it easier?
J. Paul Reed: Yeah. Yeah. We’ll talk about that, but Mandi, why have you found it stressful?
Mandi Walls: Yeah. For a lot of the same reasons, being pulled in a million different directions, and especially if you… I mean, I will confess, the last time I was in a major incident, was probably before most of this stuff got codified as a way of working. Working with teams that don’t necessarily have the muscles or the muscle memory to sort of flex the right roles, and who’s going to talk to the business, and who’s going to talk to the customers and have all that stuff sort of laid out. A lot of that stuff rolls into the person running, whatever the incident happens to be. That was my experience with it. It was definitely the being torn in a million different directions, but also trying to solve the problem at the same time, and having a lot of different masters to answer to for the way all that stuff goes. That’s totally the way you’re living your life right now. You’re trying to work and if your kids are at home, you have all that stuff going on. Plus, I freak out every time someone rings my doorbell, I grab a mask and my sanitizer, I go to the door. That additional layer of craziness is just wild.
J. Paul Reed: Yeah. This is come from work that Jabe Bloom, he’s over in Red Hat’s, Global Transformation Office, but he’s doing his PhD in design. What he’s looking at is actually long-term design. How do you design artifacts, like a city that are supposed to last 300 years or 500 years. Right? How do you think on those timescales? One of the things he’s looked into is how do humans perceive time. One of the other things that happens is we talk with each other, and so much of our daily lives and our work depend on stories, right? The example he gives, and I really like this one, is if we go to a coffee shop and we go up to the barista, we expect a story based on all the stories that we’ve heard about that interaction and what’s going to happen. Right? That is a way that we can basically reduce our cognitive load. You can see this. If I said, “Hey, I’d like a cup of coffee,” and they say, “Would you like fries with that?” We’re going to go away, that’s not the story. What you’re doing doesn’t… The story doesn’t make sense. Right? Why is all this relevant? In an incident, we have told our customers a story. We’ve told our leaders, our business leaders and business partners, a story. An incident in a complex socio-technical system that we’re talking about is an instance where the story has broken down. The promise that I’d made you, that’s implicit in the story about the service will work, is no longer true. What is basically happened is that, then if you scope into the incident, the stories that we engineering teams told each other turned out to not be true, or the story that AWS told us turns out to not be true. The reason that it’s stressful, is because all of the inferences that we made about the future, and the stories that basically reduce the cognitive load for us, are not true, which means we now have to pay attention in the moment, and the bandwidth to do that on our brain is incredibly high. We have to pay attention to every little detail, because we can’t rely on the stories that were told to us about these systems anymore. We have to pay attention. Right? Okay. “What? Does [inaudible 00:20:49] actually do that? Wait, that’s not right.” That overhead of our brains is super, super heavy, so that’s why it creates stress. Now, one of the things at Netflix that we’re really thinking a lot about is okay, and this is impacted every team in the same way, is we had to basically reform the stories, all the stories that were told were about society and the economy, and all of those things broke down because of COVID. That’s one of the reasons it’s not only stressful at work, it’s stressful at home and life, because you’ve got these impacts that… There’s the impact of now your kids are always with you, and you have to tend to them, or you’re worried about them, but also, all of the other stuff that you’re talking about, Mandi. I love the example you gave, where you ring the doorbell. The story about what’s going to happen when you ring the doorbell has changed. Right? When you go back and look forward thinking about what does this look like in operations, the reason, and one of the things that I’m a little worried about is teams are going to have to come back together, and they’re going to have to figure out what the story is. Then they’re going to have to share all those stories with all of their partners, whether it be their customers, or their business leaders and all of that stuff. The story is very incoherent now, and we have to make it coherent again. You were asking like, “Okay, well, in an incident, how can you reduce that stress?” One of the things… and you were talking to Mandi about muscle memory. One of the skills that a really good incident commander has, is they’re able to shift through time. What I mean by that is they’re able to look at the story of two weeks, four weeks, six weeks of the promises we made about the system they’re working the incident on, and say, “Okay, how is that story different from the story I see in the next five minutes, or the story I see in 10 minutes? What you find really fascinating is when incidents start to become less stressful, or you could ask, when do incidents start to become less stressful? They start to become less stressful when you say, “Oh, I think that’s it, and I think if I fix this, it’ll fix in 10 minutes.” It’s basically when you are able to tell a story that is coherent with everyone and get back to this long form, six, eight weeks, six months story. Right? That’s one of the things, I think, that we’re just starting to emerge. I see this in a lot of different places, lots of friction between people, because they haven’t realized their stories that they told each other aren’t actually there anymore, and they’re not coherent if they are yet. There’s going to be a period, I think in the next… When we talk and we talked about working remote, that’s going to add to it, right? I think you’re going to see these sort of when you want to say systemic stressors, right. That’s like working from home and those things. Those will probably start to ease up. We’re highly adaptable as a species, so we’ll get used to that, but there’s going to be… And I see it on teams that I have friends on, you’ve probably seen it on your team, we’ve seen it on our team. We had an off-site, a team off-site to figure out what we’re going to do for 2020, the last week of January. You’re laughing at that. Right? All of those things that we came up with are not a thing anymore. We’re going to have to, as a team, revisit that, and it’s going to have to be a conscious deliberate activity, and teams should be prepared to do that. I think that will actually start to help construct a new normal, if you will. One last thing I’ll add to that. We knew that we couldn’t keep this incident going, as it became clear that COVID was going to be a thing for a while. We knew that we couldn’t keep the incident open, so we started talking about what are the exit criteria, and how do we want to coherently think around the incident, and what we basically came to is, we wanted to help Netflix navigate the increased requirement for teams to become adaptable. Right? There was a lot of patterns that we relied on, that started shifting because of COVID, and so the incident was really about, okay, we need to step carefully and lightly because we’re learning how to be increasingly adaptive in this new environment. For some folks, and some teams, the requirement to have adaptive… We actually call it adaptive capacity. Right? The requirement for that adaptive capacity actually has gone down, because they’ve figured it out, but for other teams, they’re still having to be adaptive and innovative in the way that they do work, but they know that now, so they know what they need to do to keep that adaptive capacity level high. That’s how we decided, okay, that with a couple of other extra criteria, we were able to say, “Okay, we feel good about the state of risk with respect to COVID in front of the Netflix socio-technical system.
Mandi Walls: Yeah. So many things to keep top of mind as stuff changes. All right. On the podcast we have two questions, we ask everyone, so I’ll ask the first one, what’s one thing you wish you would have known sooner when it comes to running software in production?
J. Paul Reed: What’s interesting is, as I have focused more on post incidents and incidents in operations, it’s the people, man. It’s people all the way down. Right? A lot of times it’s so interesting to me that we talk about working in complex systems, but then when it comes to incidents, people are like, “Yeah, yeah, it’s a complex system, but it’s the dominoes, right? If I stop the dominoes or don’t tip the dominoes over, then we won’t have the outage. Right? There’s this very linear way of thinking about stuff, and the dominoes don’t account for the people. Right? It’s just that idea, if you are on a team, as an operations person, or a developer, or a business person, right? The people operating in your system are doing way more work than is visible to you. That goes across even the operations teams, your teammates, we do stuff constantly to keep the system up and running, and we don’t really even think about it that way. Yeah. I wish I’d known that in my career. I mean, I only really leaned into that, I don’t know, five years ago, and I’ve been doing this for 20 years. I wish I would’ve learned that. It’s funny. If you can imagine, I was a bit of a asshole in college. I would have been less of one if I had known that there was actually people and it’s not just scripts and computers and stuff running.
Mandi Walls: I’d never believe that. Never in a million years. Our second question, is there anything about running software that you’re glad we didn’t ask you about?
J. Paul Reed: That’s a good question, right? The answer might be, why is it people all the way down? I’m kind of glad we didn’t really delve into that. Right? Some of the secrets I’ve learned in doing retros with teams, I’m glad you didn’t ask me about some of those, because they’re fascinating, but they’re not things that… I’ll write a tell-all book for the industry in 30 years.
Mandi Walls: Yeah. Someday man, someday.
J. Paul Reed: By the way, you know what’s funny about that is I know both of you. I know we all have those stories, right? It should be like an anthology of…
Julie Gunderson: I’ll buy it. I’ll buy the book.
J. Paul Reed: Incident stuff.
Mandi Walls: Do one every year, right? The year’s best horror stories in production.
J. Paul Reed: I can’t remember somebody on Twitter was saying that they wanted to do… Was it Fail Con, or something? They wanted to do a conference that was just like, you get onstage and there’s no recordings, and it’s you just actually give the real dirty laundry. I would go. I’d be there in a second.
Mandi Walls: Yes.
J. Paul Reed: After COVID though.
Julie Gunderson: Yes. After everybody puts their phone in that basket, in the back of the room.
J. Paul Reed: Yes. Those little lock things. The little pouches. I went to a… Who was it? It was a comedian. Oh, it was Amy… she has a bunch of Netflix shows. What’s her name? Amy Wong, I think. Anyway, we all had to put our phones in the little bags and I’d never done that. I was like, “Oh, that’s interesting.” Yeah. We’ll do that. Next, 2021.
Julie Gunderson: There you go.
J. Paul Reed: Fingers crossed.
Julie Gunderson: Well, J. Paul, I mean, I just want to say thank you so much for being on the show. If you had just one sentence of wisdom you could tell people, to get through this time, or your favorite Netflix show. You can do that as well.
J. Paul Reed: I can do both. It took me a second to come up with one, but Dominica DeGrandis is great. I love her. She’s awesome. She used to talk a lot about, when she was talking about Kanban, of respecting reality. The bit of wisdom I would say right now, is we need to respect reality, and as a part of that, we need to give each other a little bit of grace, and we need to give ourselves a little bit of grace. We need to keep remembering that until we’re through this, because it’s really easy to forget it. That’s the word of wisdom I would say there. Favorite Netflix show right now? Ooh. Ah, oh, oh, oh, okay. I just found out and I’ve been tweeting about this. I just found out that they have put Supermarket Sweep from 1993 on the service. I have been watching. I’ve been savoring it, just a couple of episodes a night, but it is so… because I used to get a snack every afternoon when I got home from school and I’d watch Supermarket Sweep, which that should have been a signal for a lot of things, let me tell you, back in the day. But, love that show. Then, when I’m not watching that I’ve been watching Dating Around. Oh, and I just finished Space Force too. They’re all great. I don’t generally like dating shows, but I actually really liked Dating Around, because the editing and the camera work is super interesting, so watch an episode of that, even if you don’t like dating shows.
Mandi Walls: We can take a pause to thank Netflix for the pandemic gift that was Tiger King as well.
Julie Gunderson: Oh yes.
J. Paul Reed: Oh yes, totally.
Julie Gunderson: Absolutely.
Mandi Walls: It feels like it was five years ago. But… Yeah.
J. Paul Reed: By the way, listeners out there, I didn’t mention this. I’m @jpaulreed on Twitter. If there are other shows that you want to put on the request list along with Messiah, you can tweet me. I’ll see what I can do.
Julie Gunderson: All right.
J. Paul Reed: I’ll schedule a one-on-one with Ted Sarandos.
Julie Gunderson: Oh, that is fantastic. Thank you. Just to let you know what I’ve been watching is Mind Hunter and The Good Place. The Good Place, I have now watched it three times, because… Yeah, I mean, you just have to. Mandi, what’s yours?
Mandi Walls: My brainless rewatch right now is… I’m about to kick off a 21 season rewatch of Midsomer Murders. If you’re looking for some good white-on-white crime in Britain, it’s 20 years of craziness there. That’s a fantastic one.
J. Paul Reed: I’m not rewatching it, but this whole cake meme thing? I did rewatch the one Star Trek: Next Generation episode, where Deanna Troy is a cake in Data’s dream, because I was like, “I want to see the cake meme,” but they did it in the ’80s or I don’t know, ’90s I guess. Yeah, that was my other guilty pleasure watch.
Mandi Walls: Awesome. Thanks so much for being with us today. That’ll wrap up our show for this episode. I’m Mandi Walls.
Julie Gunderson: I’m Julie Gunderson.
J. Paul Reed: I’m J. Paul Reed.
Mandi Walls: And we’re wishing you an uneventful day
Julie Gunderson: That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard, you can find our show notes @pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. That’s @pageit2thelimit. Let us know what you think of this show. Thank you so much for joining us, and remember, uneventful days are beautiful days.
J. Paul talks to us about being on the critical operations team at Netflix, what that has been like during the quarantine, and the pressure they all felt at Netflix to make sure the service is stable for their customers.
J. Paul: “What we were looking at real early on, is are we able to serve streams to our customers, are we able to provide those moments of joy?”
J. Paul continues to discuss the impact and coordination required to overcome technical challenges, and how looking at COVID as an incident helped with planning exercises.
J. Paul: “We started to shift our perspective, less around technology and systems and making sure they’re stable and all that, because we had some good evidence that things were going to be fine right at that point and we started looking more at the people impact.”
The conversation shifts to the impact of people being on-call and being required to work from home.
J. Paul: “We started to really look at, from the operations perspective, if that operations team was understaffed and underwater when COVID happened, now you’ve got a whole other set of problems to think about with that.”
He continues to talk about socio-technical thinking - how the socio part is really about the people in the system who are responsible for getting systems up and running and operating them.
J. Paul brings up the levels of impact to the people, beyond just the surface-level impacts of being at home.
J. Paul: “If you’re on an engineering team you are likely going to be on an on-call rotation for your team. So the core team will page you into an incident, where we use PagerDuty for that. And so one of the interesting things is that means we have a really large data set of what people are experiencing or what we’re seeing with paging rotations and that sort of thing. So we have a we’ve been starting to parse through that. We actually have a monthly kind of socio-technical systemic risk meeting, so we’ve started actually talking about the impacts of working from home.”
J. Paul moves on to discuss the difference between capacity and availability, and how people are the same as systems.
J. Paul: “We may be highly available or as available as people expect us to be, so we might be eight hours, you know, online in our home office or whatever the case may be. But people’s capacity is reduced during this because of the stress of COVID.”
The conversation around availability vs. capacity continues and J. Paul encourages us to give our team members more grace.
Mandi and J. Paul talk about the biggest changes we see with remote conversations and the need to be onsite, as well as the value of being remote.
He then mentions ways distributed teams can increase the cost of managing incidents and how they combat this at Netflix by practicing and doing incident management on Slack. J. Paul continues to discuss the ways folks are changing the way they work due to a lack of in-person meetings.
The conversation moves to a discussion around how humans think through stories and why stress levels are higher during incidents.
J. Paul introduces us to Jabe Bloom’s (@cyetain) research at IBM’s RedHat Global Transformation Office, and how humans process through stories that make sense. He explains how incidents that don’t follow the “stories” have broken down.
J. Paul: “And the reason that it’s stressful is because all of the inferences that we made about the future and the stories that basically reduce the cognitive load for us are not true, which means we have to pay attention in the moment. And the bandwidth to do that on our brain is incredibly high. We have to pay attention to every little detail because we can’t rely on the stories that were told to us about these systems anymore.”
J. Paul explains that Netflix couldn’t keep the COVID incident open forever, and how they needed to learn and become increasingly adaptive in the new environment.
J. Paul: “The requirement for that adaptive capacity has actually gone down right, because they figured out, but for other team they’re still having to be adaptive and innovative in the way that they do work, but they know that now, so they know what they need to do to keep that adaptive capacity level.”
Just a reminder, if there is a series you want to beg Netflix to bring back J. Paul offered you the ability to tweet him @jpaulreed with your requests.
J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix’s Critical Operations & Reliability Engineering (CORE) team, focusing on incident analysis, systemic risk identification and mitigation, applied Resilience Engineering, and human factors expressed in the streaming leader’s various sociotechnical systems.
Reed is an internationally recognized speaker on operational sociotechnical complexity challenges and opportunities, Resilience Engineering, and DevOps and holds a Masters of Science in Human Factors & Systems Safety from Lund University.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.