J. Paul talks to us about being on the critical operations team at Netflix, what that has been like during the quarantine, and the pressure they all felt at Netflix to make sure the service is stable for their customers.
J. Paul: “What we were looking at real early on, is are we able to serve streams to our customers, are we able to provide those moments of joy?”
J. Paul continues to discuss the impact and coordination required to overcome technical challenges, and how looking at COVID as an incident helped with planning exercises.
J. Paul: “We started to shift our perspective, less around technology and systems and making sure they’re stable and all that, because we had some good evidence that things were going to be fine right at that point and we started looking more at the people impact.”
The conversation shifts to the impact of people being on-call and being required to work from home.
J. Paul: “We started to really look at, from the operations perspective, if that operations team was understaffed and underwater when COVID happened, now you’ve got a whole other set of problems to think about with that.”
He continues to talk about socio-technical thinking - how the socio part is really about the people in the system who are responsible for getting systems up and running and operating them.
J. Paul brings up the levels of impact to the people, beyond just the surface-level impacts of being at home.
J. Paul: “If you’re on an engineering team you are likely going to be on an on-call rotation for your team. So the core team will page you into an incident, where we use PagerDuty for that. And so one of the interesting things is that means we have a really large data set of what people are experiencing or what we’re seeing with paging rotations and that sort of thing. So we have a we’ve been starting to parse through that. We actually have a monthly kind of socio-technical systemic risk meeting, so we’ve started actually talking about the impacts of working from home.”
J. Paul moves on to discuss the difference between capacity and availability, and how people are the same as systems.
J. Paul: “We may be highly available or as available as people expect us to be, so we might be eight hours, you know, online in our home office or whatever the case may be. But people’s capacity is reduced during this because of the stress of COVID.”
The conversation around availability vs. capacity continues and J. Paul encourages us to give our team members more grace.
Mandi and J. Paul talk about the biggest changes we see with remote conversations and the need to be onsite, as well as the value of being remote.
He then mentions ways distributed teams can increase the cost of managing incidents and how they combat this at Netflix by practicing and doing incident management on Slack. J. Paul continues to discuss the ways folks are changing the way they work due to a lack of in-person meetings.
The conversation moves to a discussion around how humans think through stories and why stress levels are higher during incidents.
J. Paul introduces us to Jabe Bloom’s (@cyetain) research at IBM’s RedHat Global Transformation Office, and how humans process through stories that make sense. He explains how incidents that don’t follow the “stories” have broken down.
J. Paul: “And the reason that it’s stressful is because all of the inferences that we made about the future and the stories that basically reduce the cognitive load for us are not true, which means we have to pay attention in the moment. And the bandwidth to do that on our brain is incredibly high. We have to pay attention to every little detail because we can’t rely on the stories that were told to us about these systems anymore.”
J. Paul explains that Netflix couldn’t keep the COVID incident open forever, and how they needed to learn and become increasingly adaptive in the new environment.
J. Paul: “The requirement for that adaptive capacity has actually gone down right, because they figured out, but for other team they’re still having to be adaptive and innovative in the way that they do work, but they know that now, so they know what they need to do to keep that adaptive capacity level.”
Just a reminder, if there is a series you want to beg Netflix to bring back J. Paul offered you the ability to tweet him @jpaulreed with your requests.
J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix’s Critical Operations & Reliability Engineering (CORE) team, focusing on incident analysis, systemic risk identification and mitigation, applied Resilience Engineering, and human factors expressed in the streaming leader’s various sociotechnical systems.
Reed is an internationally recognized speaker on operational sociotechnical complexity challenges and opportunities, Resilience Engineering, and DevOps and holds a Masters of Science in Human Factors & Systems Safety from Lund University.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.