Incidents, Response, and the People With Tim Nicholas

Posted on Tuesday, Jul 20, 2021
Tim Nicholas, Principal Engineer for SRE at Xero, talks to Julie Gunderson about the importance of using incidents as learning opportunities.

Transcript

Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson, @Julie_Gund on Twitter. I am really excited to be joined today by Tim Nicholas from Xero. And you can find him @TimNicholas on Twitter. And we’re going to talk about incident response and learnings, and really importantly, the culture required to be successful when we’ve got these systems that are so complex. And that’s one of the important things is our systems are complex. How do we learn from them? How do we take our incidents and move to that next stage? So, Tim, did you want to introduce yourself real quick?

Tim Nicholas: Sure. Thanks Julie. So, yeah, I’m a principal engineer in the SRE team at Xero based here in Wellington, New Zealand. And I really try and spend as much of my time as I can focused on the incident learning area. And so making sure that we have the skills and processes and support that we need to be able to learn when things go wrong and ideally to learn before things go wrong.

Julie Gunderson: Let’s talk about that. We have incidents, they’re a thing, they’re unavoidable. Why is it important to learn from them?

Tim Nicholas: Because they’re such a good insight into how the system is working the rest of the time. So we can fix incidents without learning very deeply. That’s entirely possible. It’s what a lot of people do. And I think it’s the natural thing to do. But by being more deliberate about learning from them as well, then we learn a whole bunch about the rest of our system and how we do work and what the challenges and constraints and trade offs are that shape the rest of our work lives. And so that’s sort of this opportunity that incidents give us is a situation where we expected one thing to happen and something else happened instead. So we’ve got disconnect between our expectations and reality, and that’s fundamentally a really great opportunity to learn from.

Julie Gunderson: One of the things that we like to say is incidents are a gift, right? They are a way for our systems to talk to us. And we’ve definitely seen organizations that see incidents as failure, but when you learn to embrace that, failure isn’t something that should be a terrible thing within your organization if you can learn from that. But that also does require the right culture. You spend a lot of time talking about this. We’ve talked about the culture required to learn from incidents before. What do you think the most important thing is there?

Tim Nicholas: Certainly the most important thing is the safety of doing the learning. So if you’ve got a culture where learning more about the reality of doing that work is not something that people feel safe to do, shining a light on, perhaps some trade-off that’s required or some pressure that’s on a team or people doing work at any level. If it’s not safe to look at those things and not safe to talk about them, then it’s very, very hard to learn. And it’s very hard to make any difference to those pressures that are on people doing work. So that’s definitely the most important aspect of culture, but it’s also something that learning helps build. One of the realities is that progressing to the point of being able to do this learning more and more productively, it requires some safety, but it also helps build that safety by creating knowledge about those pressures on each other. It makes it easier to understand and accept that something has gone wrong somewhere else in the organization. And that’s sort of over time, it can snowball in a positive way towards safety. It’s not a very robust snowball because you could have someone come in and make it unsafe, undermine that very quickly. But nonetheless, safety is the most important thing, but it doesn’t have to be complete before you can start something that develops as you go along is reinforced.

Julie Gunderson: Let’s talk about that, because we talk a lot about psychological safety. We talk about blameless postmortems. But let’s say that we have folks that are in an organization that might be new to this. What are some of the things that you can do to create that safe environment?

Tim Nicholas: Making sure that there’s enough time for these things to happen is really important, which are the easy to rush. It’s really easy to think that they have to get to a conclusion really quickly and that it has to be really concrete outputs and all of those things. And so making space for people to be curious, to ask questions, to reflect on their activities. And that reflection could happen in private rather than in public I think, can happen in a lot of different ways, but finding ways to support that is important. And I think time is critical, valuing the reflection piece is really critical. So if an organization can be explicit about making some space for this thing and saying, “This isn’t about certain things, this is about learning,” then that can change what those conversations are. But there’s also a challenge there in terms of being able to value that learning, being able to see it as something that’s real and important rather than just, well, we should be fixing the thing. We know that this thing’s broken, we should be spending this time over there. And when we’re pressed for time, pressed for capacity, that can feel like a hard argument to make. And that’s where I think just saying, “Look, we need to carve out this time, we need to do this learning and then we can go and do those other things and it’s fine,” is totally worthwhile.

Julie Gunderson: Yeah. And I think also making sure that you share those learnings widely through the organization so that people understand that it’s okay. We have learned from it, and this is supported by leadership and by our peers that we share this and with how complex our systems are getting. And at the end of the day, complexity isn’t going away. You can’t remove complexity anymore. Systems are so large and so complex that people will find ways to break systems in new and unique and very creative ways that we would never imagine before. So I think that when organizations always are on the hunt for that root cause, that also contributes to the lack of safety because as engineers, holding that mental model of those systems in your head and looking at all the way, everything interacts, it’s just not possible anymore.

Tim Nicholas: No, absolutely. It’s definitely not. And I don’t think it has been for a really long time. The thing with root cause that it contributes to that lack of safety is that because it’s arbitrary, which thing is chosen as being root cause if you feel like that’s a useful concept and people focus on finding that root cause, then any particular activity might get pointed to as being root cause. Right? Any layer in the system, someone might decide that that’s the point they’re going to stop and say, “Oh, well, there’s the cause.” And not continue beyond that to understand how did that become the case? What contributed to that reality? So that’s the danger of the concept of root cause, but I’ll just go back to an earlier point that you made about the importance of sharing those learnings out to the organization. And it is really important to share learning out to the organization no doubt. But in the early stages of learning how to learn and building that safety, maybe that’s not the priority. So one of the most important things early on I think is to make sure that the activities that we do are contributing meaningful value as soon as possible. And that means making sure that they’re really useful to the people who are participating in the activity. And so if that’s an incident review, then the first thing that it has to do is be really valuable and a great learning activity, process for those people who are directly involved. And initially the sort of way that that is amplified going out to the rest of the organization is a secondary concern. That’s sort of a bonus value you can get from it. It’s a really big bonus, but one of the ways of making the process safer is to say, “Well, let’s leave that as something that we want to do, we want to aspire to, but we need to learn how to learn locally first. We need to build these skills and then maybe we can add some different forms of communication out after that.” Because if we focus on that first, then we start thinking, how do I package this in a way that’s going to be able to go out to the organization?" Or I only talk about things that I think are relevant to the organization. And then that, again, misses out on a bunch of learning and a bunch of curiosity about these different avenues that we can talk about when something has gone wrong, when some work has been complex, when coordination has been a challenge, perhaps the coordination has been a challenge just between members of the same team or the same department. And so it can seem like that’s not relevant to talk to for a bigger audience and it may or may not be. And probably actually it is really interesting because those will be communication patterns that are repeated. But by focusing first and foremost on how we make sure this is valuable to the people participating, and then sure, can we get more value from this by communicating out, by writing it up in different ways? For me, it’s really important to have that in that order.

Julie Gunderson: I think that’s a fair statement too. One of the things that we talk about is just when writing the post-mortems, having somebody who’s trained in blameless postmortems review that before it goes out. So I think you’re right on the incremental learning steps, but you also hit on something there too, because it’s about more than what happened with the system. It’s also about understanding how did that whole response process go because you said communication patterns, these things will be repeated. And I think that sometimes that also gets lost within that incident review. You want to talk a little bit more about your recommendations for the things that we should be reviewing during that?

Tim Nicholas: Definitely, there’s a huge amount of value in looking at not just the way that the technical system behave, but also looking at how the human system around that behaved. Because we are dealing with two socio-technical systems, as they say in the fancy lingo. But there are people and there are computers and they are working together to try and achieve some outcome. And they’re fundamentally the same system in many ways. And so if we just focus on what the computers are doing, then we’re missing out this sort of huge piece that we can learn from and that we really need to understand and to be able to help it adapt and help it be more expert in the future. And the challenge there, I think a lot of the times when we talk about blameless postmortems, if we’re not doing some really deliberate learning about what that means and how to think about human error and how we work in complex systems, if we’re not doing that, then we can end up trying to be blameless by not talking about things. And so there’s this trap where I’m told I’m supposed to be blameless, but I haven’t got the skills to reframe or to think about what happened in this situation in a way that isn’t blameful because blame is a really natural way to look at something going wrong, or it’s involved someone doing something. It’s the thing that’s easiest to do. And so we have to learn how to not do that. And it’s not just a matter of being told not to do it. One of the things that I have seen is teams be told they should do blameless postmortems. It sounds like a good idea. They don’t want to blame people, but they don’t know how not to. And so rather than that instruction being helpful, it ends up with them just focusing really narrowly down on the technical pieces because they can talk about those without blaming anybody, but they also have necessarily missed out on a whole bunch of really important details about how people responded, how we came to be in the situation. So it’s necessary to learn about what is it about local rationality? What was the situation for that person making that decision that led to it making sense for them at the time? How do we understand how they were doing what they were doing in that situation in a way that helps us understand the pressures on them, on the trade-offs that they were having to make on what they thought the world was like at that moment, when it turns out later on, actually it was different than how they thought it was.

Julie Gunderson: J. Paul Reed has a lot to say on this. We’ve actually had some conversations with him on this podcast. And I think you and I have chatted about this before too. The Infinite Hows by John Allspaw is absolutely a worthwhile read. And we have an episode with him as well. So for those that are new to this, maybe check that out. And I heard you saying the how. How, how, right? And that’s a thing that we consistently need to think about. It’s the way that we talk to each other and the words that we’re using. J. Paul Reed, he’ll say that blame is hard [inaudible 00:13:30] hired over thousands and thousands of years of evolutionary biology, which is true. And there’s a lot of different types of blame, is it hindsight 2020? I would say that that’s a very common one or the bad apple theory, right? If we just kicked Tim out of the organization and get rid of him, we will never have another incident. And that’s just not the way it is anymore. I think though, that some people do get tripped up a little when we talk about blame versus accountability. And how do you keep that accountability there while still maintaining that blameless environment? Do you have any thoughts there?

Tim Nicholas: Yeah, it’s really hard. And this is going to seem, because it did a little bit to me when I first heard about it. It seems a little trite, but Sidney Dekker talks about accountability and focuses on the word accountability and how you can have multiple senses of that. And so we can have accountability in the sense that we hold somebody to account and say, “You have to pay for this.” And that’s the thing we’re trying to avoid. An alternative way of looking at those same situations where we want accountability is to give people the opportunity and the stage and the support to tell their account, to actually be able to describe what that situation was for them that led to this making sense and to have other people be able to get in there and understand what they were facing. And that’s really important even to someone who’s perhaps being harmed by something going wrong, because it allows them to see genuinely, A, that the person was not necessarily stupid or malicious, but also that having those accounts, we’re able to see insights into the system. And so we’re able to see what other things could have been done or could be done in the future to prevent harm coming to somebody else. And so telling those stories is really, really important, but it’s sort of only really meaningful when they are told to the people who are harmed or to people who are able to use that information to adapt things in the future. One of the big challenges of doing this kind of work is that it fundamentally exposes conflict and dysfunction. And so by looking at how incidents happen and saying, “Well, how did we get here? What were these pressures? What were the trade offs that we were forced to make, well, that we did make, not necessarily forced to make, any particular trade-off. Then we end up looking at different layers of the organization and different ways that people function together and how teams and departments and organizations collaborate and communicate what they need to each other and how we distribute our resources. All of that stuff becomes pertinent as we start looking at, how did we get here? And it spans different timescales, but it’s also necessary to being able to be deliberate about some of those choices that we make, because we might make a choice one year about how we resolve something. And then that has consequences two years further on, where suddenly we don’t perhaps have the expertise that we need for some new change in market conditions let’s say, where we say, “Wow, actually we need to prioritize this other thing,” which actually now we’ve got to spin up, work out how to understand that domain because we made some priority decisions a year earlier. And it’s not that we can necessarily anticipate what’s going to change, but we can anticipate by seeing how the system functions and by being really critical about that. We can see where we might be missing some capacity or running really close to the line, or at least be conscious of the trade-offs that we’re making and say, “Well, okay, cool. So we can do this, but down the line, we’re going to have to deal with the consequences of this.” Or we might have to deal with the consequences of this if something happens and exposing that is really important, I think, to making better decisions in the longterm.

Julie Gunderson: Absolutely. And that’s why one of the things that we talk about, and folks who are listening can check out postmortems.pagerduty.com for how we look at it at PagerDuty anyway. And some of the resources that you have there is having the right people in that post-mortem, that incident review meeting. So we have people from product and engineering and leadership there because those after action items, they do need to be prioritized. And do you have SLAs around that? And that’s one of the things that I hear a lot is, we have our post-mortem and then nothing gets done with that information, the after action items, they don’t get done. And that requires much more than just one team to get those done. Folks need to agree, right? There might be a new feature that has to be paused to make this part of the system more reliable. Or as you mentioned, maybe the ROI of fixing this isn’t worth it right now, but in a year down the road, it might be. And so it’s also important to capture the learnings that we had. And how did we come to that conclusion now so that in a year, in two years, we can go back and look at that and, oh, these were the conditions. This is what happened. A global pandemic hit us and we need to focus on making our product usable for everybody. And so we’ll tackle this down the road. And I think a lot of organizations are in that boat right now.

Tim Nicholas: Yeah, absolutely. We’ve talked about this once before, but when we get a whole bunch of action items out of a review, things that we suddenly want to change, it’s very much usually, this is the, if not the heat of the moment of the incident itself, then pretty close. And often if it’s in an incident review meeting that we’ve come up with these sorts of things, then they’re not necessarily… The plan that we’ve come up with in that meeting doesn’t involve a well synthesized understanding of the system that we have got from the meeting. And so lots of different perspectives will have been shared about how the system worked, about how we responded, about what the pressures were. And then immediately afterwards, we come up with a plan and we say, “Well, we should do X, Y, and Z. And if we do that, this will never happen again and we can live happily ever after.” Which is really appealing, right? Because we all want to live happily ever after. And we all want to not experience that pain again. And we want to be able to tell other people that we’re not going to experience that pain again. And so there’s a whole bunch of motivations for saying throughout, “We’ve got a plan. It’s cool. Don’t worry about it.” But actually often those things don’t get done and they don’t get done because they’re even not well understood why it’s a good idea or because they are not actually a good idea once we have some time to reflect on it. And maybe X, Y, and Z are all closing off the same. Quite often, you can essentially duplicate things where we say, “Well, we want to make the system more robust to this failure. And so we’re going to put this checking, and that checking, and that checking,” but they’re all checking for the same thing. And they’re all perhaps too specific to the one scenario that happened this one time. And then not, perhaps because we’re so focused on that particular event, maybe have not been generalized, have not been thought about, well, actually, maybe we are doing the wrong work to create this scenario. And this sort of the challenge of Swiss cheese model of accidents is that it can look like, well, that’s all right, we’ll just put another slice of cheese and it doesn’t have a hole with this particular set of events lined up and then it’s solved, but that’s not the reality, right? Actually the system’s constantly shifting and changing. And by solving the problem that we had last week without thinking about how the system is changing and adapting and how we’re having to change and adapt and what the changing needs are from our customers, then we’re just fooling ourselves and moving on. And so we have to look at the more complex system and understand why actually this is a dangerous system. This is something that can go wrong because all of the systems are dangerous systems that can go wrong. They are going wrong all of the time in some way, but it’s preserved as a functional system that achieves what we want it to achieve because of all these people who are doing stuff, that’s how the system is maintained. It’s how the system keeps functioning and being safe is by people doing things.

Julie Gunderson: Absolutely. And I love it. Now, because we’re almost out of time, I want to make sure we get to the two questions that we ask every guest on our show.

Tim Nicholas: The two questions.

Julie Gunderson: The two questions. So first, what’s the one thing you wish you would’ve known sooner when it comes to running software in production?

Tim Nicholas: I could have benefited a lot earlier from understanding about local rationality. So I’m going to answer in terms of when things go wrong and dealing with it afterwards. But I think it took me quite a while to really get to grips with how to, and I’m still getting to grips with it, how to really think about other people’s scenario, other people’s situation when something unpleasant happens, when something is surprising. And developing that skill, I think is really important for people in modern organizations. Where we’re collaborating with so many people that we maybe don’t have really close relationships with, it’s really different than when you’re working in a team with five or six people and they are most of the people that you deal with or that are going to make decisions that you have to respond to. That’s a really different scenario than in a modern organization that might have hundreds of people deploying software, making choices about architecture and implementation and response, all of which might result in some pain being felt by you and your team. And so working out how to think about other people’s experience and being able to inquire into it, I think is something that would have stood me in good stead earlier.

Julie Gunderson: I love that. I actually haven’t heard that one before on that question. That’s a great answer. Now, is there anything about running software and production that you’re glad I did not ask you about today?

Tim Nicholas: I’m a crap program. And so anything about that. Let me think about… I’ve been so focused on other things that questions about actual implementation, I would have to be extremely careful about answering.

Julie Gunderson: Okay. We’ll save that for another time then. Well, thank you. And thank you for being here with us today. I always love having conversations with you. We’ve had quite a few in the past and just really have enjoyed our time together. So thank you.

Tim Nicholas: Been a pleasure.

Julie Gunderson: And hopefully we’ll have you back. And then this is Julie Gunderson signing off and wishing everyone an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @PageIt2theLimit using the number two. That’s @PageIt2theLimit. Let us know what you think of this show. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Show Notes

Additional Resources

Guests

Tim Nicholas

Tim Nicholas (he/him)

Tim is Principal Engineer for SRE at Xero, a cloud-based business software company with over 2 million subscribers. Tim’s years creating and responding to “surprises” in production have fueled a passion for learning from incidents. He is an advocate for the people grappling with complexity in high pressure circumstances.

Tim has spent over 20 years working with technology and infrastructure at scale. He cut his teeth building and supporting large scale network, data and compute systems for the VFX and animation industries. In 2014 he moved to the SaaS world with Xero where he has worked across a number of disciplines as an engineer, team lead, product owner and architect - each adding to his perspective on software systems engineering and operations.

Hosts

Julie Gunderson

Julie Gunderson

Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.