Prioritizing Post-Incident Work With J. Paul Reed

Posted on Tuesday, Oct 3, 2023
J. Paul Reed returns to the show to talk more about a series of posts he published on Medium discussing how likely post-incident action items are to be completed.


Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter. Welcome back to Page It to the Limit. I’m Mandi Walls, I’m back with J. Paul Reed. It’s been a few years since you’ve been on the show with us, so welcome back. What are you doing these days, man?

J. Paul Reed: After Netflix did a bunch of consulting, I went back to consulting actually. If folks remember, I was doing that before Netflix and then at Netflix they were like, “We would like you to do that full time, but for us.” And so I’m doing that again after a bit of a break. I don’t know if folks know this. I was on the core team at Netflix, which the way we describe that is it was the team that holds the pager for Netflix, and I was on it when, right when the pandemic started and through the pandemic. And so that was a very wild ride. It was fun, I really enjoyed working with all of my teammates through that period. But as you can imagine, in those first couple months we were all on that couch with popcorn watching Netflix. And it’s weird, our team got calls from the Italian government at the time saying, “You’re using too much bandwidth, please use less.” We were involved a little bit. That was more sort of the networking team and the CDN team fixing that. But my point is, the pandemic brought a bunch of these weird problems that you wouldn’t think. It’s like the hospitals need the internet, show the little slightly degraded video version of that, so we can-

Mandi Walls: You don’t need the high-def version of Tiger King or whatever for everybody who’s sitting at home.

J. Paul Reed: Yeah, the Tigers look real enough.

Mandi Walls: Oh yeah, absolutely. Anyway, you put out a series of articles which went around in our internal slack of PagerDuty about the last frontier of incidents, and you have an incident. You know what happened. You have the post-incident review and then the dark cloud descends on the team, because there’s action items.

J. Paul Reed: Yeah, they’re the other AI that is scary.

Mandi Walls: The other AI, yeah, yeah, the other AI. The AIs that will never get done. We want to talk about that today, because you put this three-part series out and I’m like, “Oh yeah, that’s so true. That’s so true.” Because we see this all the time. You go through the motions of action items and you come out the other end with this list of things to do and then it disappears into the wind.

J. Paul Reed: Right. Right, right, right. This series of posts is really funny. It turned out to be a series of posts, but originally the idea was just like, “I’m going to write a post about this idea of a spectrum.” But then as I was starting to write, it’s like, “Okay, we’ll just set it up, you need to make the complete argument.” And so the first post is really about kind of a spectrum, but it has this idea of benefit into it. And so it looks very much like a consultant. Again, I was doing consulting, it’s the consultant quadrant. And so you have this idea that on the horizontal, this idea of cost, so you’ve got low cost, high cost. And then on the other axis you’ve got benefit, low benefit, high benefit. And so then I looked at the corners of the quadrant, the extremes, because it’s somewhat intuitive that a low cost, low benefit might get done, but probably won’t.

Mandi Walls: Unless someone’s really excited about it it’s probably not going to get prioritized [inaudible 00:03:37].

J. Paul Reed: Well, and actually those sorts of action items I block, they’re the aesthetic ones. So, they’re like, some engineer is a Pythonista or a Go person, and they’re like, “Somebody wrote some Go code and they’re not a Gopher,” I guess they call those folks. But maybe somebody wrote some Go that is an older version and there’s a new style and it’s just bugging them, right? It’s logically correct, but the aesthetic is bad. So that’s low cost-ish to fix, although I’m sure people have fixed stuff like that and caused an incident too, and it’s also low benefit. So there’s that cluster. Then of course, on the other end of that spectrum, and these posts, by the way, I’m sure you’ll link to them, they have, I drew them on my iPad. I had nice low pictures. But there’s the high cost and low benefit, and those are obviously action items. Just no one ever, no one will get sanctioned to do. And then as we progress through, what I basically did is again, this idea came from originally just a spectrum, but then you have to go backwards to understand what are the aspects of the spectrum. So, the second post really talks about taking this aspect of benefit out. And the reason I think that’s important is, if you read that first post or you’re talking about it with your colleagues, benefit is doing a lot of heavy lifting in that.

Mandi Walls: It’s doing a lot of lifting for all kinds of things.

J. Paul Reed: And I always laugh, because we have all in a post-incident review had conversations about, “Is this going to be high benefit? Is it low benefit?” And you and I were laughing a little earlier about data-driven, and so well then people try to go get data around benefit, but does benefit mean? Will it get us more customers? Can we prove that it will cause less incidents? That can be weird. I wanted to take out this concept, just remove it. And in the safety sciences, I referred to Sidney Decker’s quote about “Nobody goes to work to do a bad job.” Although there’s another post I might make at some point where, well, not that people go to work to do a bad job, but I think there’s maybe nuance now about, “We’re all kind of tired and burned out. We may go to work and accidentally not do our best job, because a lot of us are burned out these days.”

Mandi Walls: You can go and do an ambivalent job, but maybe they’re not hostile.

J. Paul Reed: Right? It’s not malicious. And so back to the benefit conversation, that’s a great example of where most engineers I know don’t suggest things that they don’t think are beneficial. Even in that aesthetic example to them, they want a clean code base and they think that that has a benefit. I think personally, I think that has a benefit. And then you can have an argument or a debate with your engineering manager or whether or not things like aesthetics are important in the code base, but the point is that they’re suggesting these action items in good faith, right?

Mandi Walls: Yes. To their lens of things that are going to be beneficial. It’s totally understandable that they’re going to have a narrower view of the things that they’re thinking about that someone else will.

J. Paul Reed: And so I didn’t really kind of go down this road in the post, but I actually think one of the things that is always a good discussion to have when we talk about benefit is to actually explicitly call out what we think the benefit is and who it’s serving. Because a lot of times, and I’m sure you’ve talked a lot about this in post-incident reviews. It’s like we often don’t have customers or poor people. We don’t have salespeople or write their perspectives. And they may have been impacted by an incident, but they seldom get to come in and talk about what the action items might be. We’re talking about benefits connecting that what constituent group does it benefit? That can be a good explicit conversation to have. But then the question or the discussion was, well, if we remove this concept of benefit, and then also we talk about, we kind of tweak that X-axis. Instead of going from low cost on one end to high cost on the other, we start talking about probabilities of whether that action item will get done. That becomes a little bit of a more interesting spectrum. And so the fundamental kind of thesis of the spectrum is, the ends are important to look at and they’re interesting to look at. So let’s look at those, right? We’ve got the low cost items that probably have a high probability of completion, and we’ve all seen those. Those are the ones that are done by the time we get to the post-incident review. We list it up and somebody chimes already did those first three. Because they’re so beneficial, they’re so low cost in the context of what we’re talking about. They’re just already done.

Mandi Walls: Well, and especially if the incident was caused by an escape defect, you’re going to go back and fix that defect anyway because you want to move the feature forward. And often that work gets done before we get around to the incident review. It has to go out anyhow.

J. Paul Reed: Right, well, and actually no, I’m really glad you brought that up because the real version of those particular action items are action items that you actually do during the incident, right? What’s so funny is we don’t tend to think of those as action items, which that’s just an interesting commentary on how engineers think of there’s an incident and then the incident ends, it ends, and then there’s other stuff we do, and it’s like, well, yeah, but those are action items before too.

Mandi Walls: That was all kicked off because you had the incident. It’s all downstream of that incident activity.

J. Paul Reed: Right. Yeah, 100%. You’ve got those, and in the little graph that I drew, it’s like the it’s already done, right thing. Yeah, we already did that. And then on the other side we’ve got the, it’s never ever, ever going to get done. And I’ll give a couple of examples. I really like, one of the stories that I actually described in the post, but look, it’s never getting done, would be like we have a complete Windows infrastructure and Windows code base, and someone’s like, let’s rewrite it in Linux.

Mandi Walls: Wait, What? Now there’s problem if it was in Linux.

J. Paul Reed: Yeah, yeah, yeah, right, exactly. Because it’d be like, “Yeah, that’s nice. Thank you for that suggestion, next.” The other example that I actually gave, and again, I love this story. It was an outage that cost hundreds of millions of dollars. It was in the news, I worked on it. It was an incident that took weeks. The actual outage I think was across two or three days, so it was pretty long as incidents go, but then the follow-up took probably a quarter maybe. And long story short, it was a set of settings at a very low infrastructure. It was in the bias of various machines. And when they had deployed this data center years and years and years ago, they didn’t have a standard for setting those settings. So this bit them with some higher level software sitting way on top of an operating system on top of a platform and on top of that. But this low bias setting, certain blocks on disc got messed up and it was unhappiness all around. And so, one of the action items would be like, go flip that setting on every … Make sure that the setting is the same on every machine. That seems reasonable, right? It’s not too hard to do. We’re talking even if it’s a thousand machines. And what was interesting though is this was already in the context of they knew that there were gremlins in that data center. They were already moving people off of it. They had already incurred this cost. And here’s an interesting thing. They were already kind of doing the action item of moving people out of the data center before the incident happened. That’ll fry your brain about these action items that we’re doing before an incident happens. But in that case too, this incident would give more credence of like, “Yes, we should actually do that action item.” But in this case, what they actually said is, “Well, we could do that. It’s a reasonable amount of work.” And I’m betting, I thought this too, a lot of folks and a lot of management folks are probably thinking, but what if actually standardizing on that setting causes more problems, right?

Mandi Walls: Since you don’t know.

J. Paul Reed: And because it’s been that way for a decade, who knows what’s also relying on it actually being set that way, which for this one problem was wrong, but maybe correct for some other thing or something’s relying on that setting being that way. So they basically doubled down on the migrations of moving people and infrastructure out of that data center, but they didn’t go and fix that item. So that’s an example of it ain’t ever getting done because, and that’s also an example of a nuanced conversation. And that was really the point that I think is relevant to talk about. And as a consultant and working on operations teams that do post incident reviews, to me looking at the made a discussion of the middle of that spectrum. So it’s where the really rich, if you just watch engineers having that discussion, again, you’ve taken out benefit. If you kind of move away from that and just start talking about the realities of what do you actually think the probability is that the action item you’ve describe will get completed and will actually get sanctioned to get funded to get completed and all of that. I call that the discretionary space, and I stole that from Rasmussen’s Triangle, and there are links in the post you can read about Rasmussen’s Triangle, but the discretionary space refers to this dynamic where we get to see expertise and action of engineers. Expertise and action and on the business side of their own determination of the risk they’re willing to take from a business perspective. We get to see organizational power dynamics. We have all been in an incident review where someone sometimes out of left field comes in and says, we ain’t doing that. And everybody’s like, surprised, but it’s the VP of engineering and maybe they have some bit of information like that product’s going to be decommissioned in a month. And that was the point when I sort of thought up the spectrum and was trying to explain the idea of action items. The part that I find most interesting when I’m doing that work, but I think that is actually really valuable for not only engineers but organizations. Is that meta a sort of observation about the discussions of what is going to get funded and what isn’t? And then once you have a good sense of that, I guarantee you we were talking about meta earlier, Meta’s appetite for certain types of risk is way different than Google’s. Its way different than Twitter. I’m not calling it that letter Twitter, and it’s way different than a startup versus a huge company. And so when you can get some insight from those discussions, I think it’s a great tool to almost help advocate for items that you may know just from previous experience won’t get funded, but that you think are really important. It helps you actually have a more nuanced conversation. One of the things that I’ve done with Teams, it’s one of my favorite exercises to do actually. Is we post all the action items, and I talk about this in the post a little bit. So let’s say we have 10 action items. I’ll ask engineers bucket the 10 into it’s already done. Maybe we’re not sure and it will never get done, but I have them do that all individually. Then I take the pieces of paper and then I see which items get put in which buckets. And I talk about this in the post. If you then go back a quarter later, or maybe you can go anywhere, from a sprint or two later to a quarter later, you then look who was right about which items didn’t get done at all. And that’s a feedback loop for engineers and managers when they’re looking at future incidents. What was the appetite or the capacity of the organization to even do fix items? Sometimes it’s not about the benefit and the aesthetics, it’s about we’ve got a new product coming out, we actually wanted to do all of those in the discretionary space, but something came along. Or you’ve been there, another incident comes along, and then those items are like, “Okay, well

Mandi Walls: Everything gets superseded by this more immediate problem. Everybody’s got recency bias on it, so they forgot they had this whole other list of things they were going to do.

J. Paul Reed: And so, one of the other things I sort of bring up is Jay Bloom’s work on Time Span of Discretion, which I think is super interesting from the standpoint of this all goes into the story that we tell about the incident. And I was laughing earlier when we were talking about action items that you do before the incident or action items that you do as part of the remediation during the incident. That all becomes part of the story that we tell. And so the reason that’s relevant here is for those action items as we move over to higher cost and lower probability of completion. One of the things we’ve all seen is that action item was in response to an incident, but now it becomes its own project. Ad sometimes it becomes its own project because people want to kill it and they know if it’s another project and it gets encompassed. the story is, “Oh, that’s not a remediation item for an incident. It’s a Q4 project.” Then it’s like Q4, that’s where projects go to die. It can also be, that’s the only way it’s going to get funded with time and engineering effort is if it becomes a project. But I think this all goes back to when we’re talking about incidents. There’s this idea that the story of the incident is actually way more valuable to an organization and to organizational learning. Then it goes back to, well, what are the stories that we tell about how we handle incidents? What are the stories we tell about action items and the way we reason about action items? And here comes the interesting part. What are the stories that we tell versus what actually happens?

Mandi Walls: Yeah, absolutely. Because like you mentioned before, the boundary will be before the action items even get kicked off, the real post-incident action items, their mental picture of that incident stops when the customer experience is restored. And to them, that’s the end of that story when really you’re probably dragging out What it really means to conclude that incident for several sprints for several months is however long these action items take to get put together.

J. Paul Reed: And I love that. I got to make a Sopranos reference. It’s one of my favorite shows, and I’m due to go rewatch it because by the way, now it’s so funny. What was that show? It was late 90s early aughts. It also has the early aughts kitsch to it too, in addition to, it’s a show about a mobster going to a therapist, which all of my Gen Z friends were like, what? That was a five season show? And I was like, yeah, it was really good.

Mandi Walls: That wasn’t just an afterschool special. What are you talking about?

J. Paul Reed: Yeah. Yeah. Back to this idea of stories. One of the things that I’ve showed James was he had a really interesting insight where executives come in and we’ve all had this. They come into an incident and they may try to take over or they may want information or whatever. There are issues with that, but from an empathetic standpoint, you can also understand they want information. But the interesting thing about that that he was talking about is the reason a lot of those conversations kind of go off the rails or they’re hard to have is because executives are generally telling each other’s stories about the organization that are on the time span of three to five years. So when they come into a situation where the time span is actually could be minutes, are we rebuilding the database? Are we doing this? Who’s rebooting the servers? Whatever it is, that’s a super short time span. And so the problem is that because they’re so used to telling organizational stories that are about three year time spans, it’s not that anyone in that conversation is stupid or whatever. It’s that my frame of reference for the time span I normally think about is years not minutes or sprints after this. So the reason that’s relevant and the tie in with the Sopranos is if you watch The Sopranos, the last scene, did you watch The Sopranos?

Mandi Walls: I didn’t watch it. Yeah. I didn’t have HBO, so I never got around.

J. Paul Reed: Okay, so the last scene is widely considered kind of one of the most controversial scenes ever. Because it’s this whole story about this mobster. A bunch of people killed over the series. A bunch of his friends do all this, and the last scene is him eating in a diner with his family, and it just goes to black. And they kind of do this thing where there are people coming into the diner and you’re like, oh, he’s going to get whacked. But they just sit down and the creator was like, I want this show started with him going to a therapist. I want the viewer to imagine what happens. I’m not going to tell them what happens. So the point is, even though the end of that story can be unsatisfying, it’s kind of like, well-

Mandi Walls: What’s the redemption? Did he get whacked? What happened?

J. Paul Reed: Exactly. Exactly. We do this with incidents, right? Because the longer the incident is open, the story is ongoing because of these action items. The reason that stuff gets pushed to projects is because the organization has to close the story. Executives want that incident story to be done. And so a lot of that discussion about action items is actually informed by an organizational need to be done with that incident. It could be a painful story.

Mandi Walls: Some stuff will just, like you say, it never gets done. It never gets prioritized, it never gets closed. Some of that will depend on, like you say, the strategic view of the company looking out years rather than the things that are going to impact everyone in the short term. Other stuff is going to never get done. It impacts people that are deprioritized, don’t have any power in the organization, are not in the top 10 folks that people think about when they’re making their priorities.

J. Paul Reed: Well, and I talked a little bit about that. If you start to plot this stuff actually out on a spectrum, if you take those little pieces of paper from the engineers and kind of plot out the action items of where they thought it was going to sit on this spectrum, one of the things, and again, I’ve done this too with Teams, you can start to see hotspots. If all of the action items related to the database magically get pushed into that kind of high cost and never get done, and they get turned into projects that never get funded and never get completed, you can start to ask questions about, “Well, is that team underwater? Does the team need help? Why are the action items that we think are important not getting completed?” Or it can be a technology problem that thing’s in COBOL and nobody wants to touch it. And every time we touch it cress a little bit, so we just shy away from it. So the thing is, it also can be a really interesting tool to understand risk before that next incident occurs. Because it’s basically, well, where the metaphor being, “I know my car is making this squeaky noise, but if I just kind of turn the music up loud enough, maybe I won’t hear it.”

Mandi Walls: I don’t have to hear it. Yeah, yeah, yeah.

J. Paul Reed: Right. It’s like, “Well, no, there’s actually a problem with your car.

Mandi Walls: You need help, man, yeah.

J. Paul Reed: Yes. Yeah, exactly. Yeah. Again, the series of posts was really some thoughts that I had had. And here’s the funny thing. This was really just explicating thoughts. I think a lot of us have about our lived experience about action items, but a lot of times when we’re in the trenches doing the work, it’s not clear what’s really going on. And so that’s one of the things I always, with the teams that I’m on, and then when I’m doing consulting, the companies I’m working with, part of that is trying to get folks to slow down to notice the meta conversation about these things. And the goal is, hopefully it’s better outcomes, but at least you have a more substantive conversation.

Mandi Walls: I feel like some folks, they get through the incident, and especially for folks who are very nervous or very stressed out responders, those kinds of folks, and they feel like they’ve lived through the incident and then they want to be just absolutely done. And the post-incident review, that whole process, whatever artifacts come out of that are just completely after the fact is not even part of their thought process. When they’re done, they went to that meeting and things are done and it doesn’t matter anymore. And too few organizations include, like you say, product managers and those folks in those post-incident reviews to be able to say, this thing is obviously a problem. It’s impacting our reliability. We need to work on it. It needs to get into the backlog in a timely manner rather than just engineers trying to then advocate for fixing the things that they saw during the incident.

J. Paul Reed: Well, you said backlog in a timely manner. It needs to get into the backlog.

Mandi Walls: And just to get there, is there a ticket? Is there a ticket?

J. Paul Reed: Right, right, right. Or again, I’ve seen this too. All the action items. This is interesting, isn’t it? The action items get assigned into the inc, the incident space, not the product space. I have seen that anti-pattern too. And so the problem with that is those become disjoint stories. There’s new feedback loop. This is one of the things I’ve been thinking a lot more about, and maybe I have more to say about it at some point, but we’ve talked a lot about stories. I really like that you brought up responders that maybe don’t like being on call. So when they get paid, it’s like, just get done with it. Again, they want to be done with the story. One of the things that I talk about is we can only follow so many stories. There’s a cognitive load to having an open story out there. So I get it, but then the conversations becomes, “Well, do we want short, easily digestible stories, or do we want actually stories that encompass what we actually need to do as an organization, whether that be give a team support that needs it underwater or refactor something that we know keeps failing that car, get the belt changed, right?

Mandi Walls: Yeah. I think a lot of what folks stumble over too is when we’re talking about incidents and reliability and even the metadata of MTTA, MTTR, those kinds of things, we try and make them feel and sound important to executives by focusing on loss revenue, loss brand value, loss reputation, and all of these kinds of things that are almost not visible to the engineers in the day-to-day. It’s like a Plato’s cave thing going on with that stuff. You see the shadows of them in your day-to-day life, but you don’t necessarily see those impacts unless you’re really following stock price and crazy things like that. That might be a way they bubble up. I feel like we shortchange ourselves on that too, trying to focus everything that way where every incident is going to lose money and everything that we’re doing is going to be a negative impact that way.

J. Paul Reed: So what’s interesting, if you think about this idea of stories, there are archetypes of stories. So like rom-com, boy meets girl, boy meets boy, girl meets girl, whatever. There’s the buildings room on the hero’s journey. And so there are archetypes of incidents too. And the one you’re talking about that’s a component of incident stories is they all lost money. What’s interesting to me about that is that’s not always true, and I see that this is a really big debate. Can you measure the cost of incidents? And what’s funny to me is I’ve worked with security teams that have tried to do that, and they even create a dashboard. You can put it in security incident number in big type. It’s like, “That cost us 100,000. And when you start digging into it, what is it? The points are double and all made up and Exactly.

Mandi Walls: And I’m tall. Yeah, it’s all fake.

J. Paul Reed: So you look at that, and this is one of those things, numbers are easy to put on graphs, and then you can look at trends, but the heuristics are always kind of weird. It’s like they’re counting the time of the people that worked on it, and they’re maybe counting missed transactions and average sales. So you can do what we missed a 100,000 transactions, they average sale, but it’s all kind of a crapshoot.

Mandi Walls: It’s back of the envelope at best. I just don’t know.

J. Paul Reed: But I always like to tell a story that, this is a Netflix story where they had an incident that made them money. So when people are like, “Oh, it always loses money.” And all of their formulas, it’s all a cost. There’s no-

Mandi Walls: All cost.

J. Paul Reed: And so here’s what happened, and this, by the way, was before my time, but it’s a story that I’ve heard multiple times from multiple people. And I can’t remember what show it launched, but it was one of their Netflix originals, one of their first ones, and it was a big deal, and they still do this. The shows go live at midnight. And some people stay up and watch them and binge them. So that happened. And of course, something in the infrastructure wobbled and it caused a lot of problems. What was interesting though is that it took three or four days, but when they started looking at the stats, what they found is because the incident happened, it made it to the news that this show was so-

Mandi Walls: It’s free publicity,

J. Paul Reed: Right, they were was so popular, it made Netflix go down for a little bit. That caused enough buzz to get a notable bump in signups. So that incident made the company money. And so when people are like, well, what’s the cost of an incident? I was like, well, what if it’s a benefit? What if it’s a revenue generating incident? And people always look like, how could we have a revenue generating incident? It’s happened.

Mandi Walls: That kind of demand stuff comes up all the time, especially when you’re talking about everybody’s trying to get tickets to Taylor Swift and all that kind of stuff. Just the fact that it’s hard to get tickets to Taylor Swift means more people want to try and get tickets. They make a game out of it or contests with their friends to get tickets to these concerts and stuff. And that kind of demand versus the problems in the supply just drive everything back up.

J. Paul Reed: You said it best, you said it’s all back of the napkin math. And so it’s whatever math you choose to look at. If you incorporated, “Well, what signups did we get or what other revenue generating numbers could we put into that back of the napkin?” Probably the majority of incidents will be a cost. I get that. But if you’re not accounting for that, maybe the cost is more marginal than we think. Who knows? And again, this all goes back to, “This is what I love about this tool.” And talking about this stuff is what are the stories we tell about incidents in our organization. What story gains traction? What incident stories are not a story that leadership and executives want told? And again, no judgment there. It’s just interesting, right? Because it gives you more information about the sociotechnical system you work in, the organization that you work in.

Mandi Walls: The organizations are living breathing creatures on their own, and they’re very much made up of organic parts.

J. Paul Reed: Squishy people,

Mandi Walls: Squishy people, and people are messy, and they will do messy things all the time.

J. Paul Reed: I mean, I will say this, right? I mean that’s why a lot of us do this stuff. The people keep it interesting, right?

Mandi Walls: Yeah. It’s different every day. So if it was just logs and bits, it wouldn’t be quite as interesting.

J. Paul Reed: Yeah, I would have no one to rant and rave to about why the stupid Kubernetes is down again, if it was just Kubernetes. I don’t think Kubernetes cares about my opinion of it.

Mandi Walls: Kubernetes does not think about you at all.

J. Paul Reed: Why not? That’s the AI feature we need, actually. That’s not the AI feature we need, but it’s the AI feature we deserve.

Mandi Walls: Well, it’s the one you want. You want the empathetic Kubernetes. I just don’t think we’re heading in that direction. The server is sad kind of comic meme.

J. Paul Reed: If you ever get bored with the podcast, that’s our startup.

Mandi Walls: Yeah, yeah. There we go.

J. Paul Reed: Kubernetes that loves you back.

Mandi Walls: I’m going to call that podcast that. That’s what this episode is called, Kubernetes that loves You back. So is there any hope for folks who are frustrated by this process of process, putting all this effort into post instant reviews and documenting things and it just goes off into the ether to die?

J. Paul Reed: Yeah. So one of the things that I would say is I’ve done this exercise with the spectrum with Teams, because again, that’s why people kind of bring me in to look at that stuff. But you can do this on your own. And I talk a bit about for the engineer, that’s just has to be an incident. Maybe likes it, maybe doesn’t, I’m a little bit of an adrenaline junkie, so I don’t mind it. I know folks that hate it, right?

Mandi Walls: Oh yeah.

J. Paul Reed: But I think one of the things that I optimized out on purpose was this idea of benefit. But in reality, we all have to advocate for the … And explain the benefit. Again, that’s a story. And we live in a certain economic system that values certain things. And so we’re going to be more successful if we understand how to tell the thing we’re advocating for in terms of things that make sense in that story. So the point is, you can actually take these action items and do this yourself and then start to understand, “Every time I suggest an action item, it ends up not happening.” But then you can start like, “Okay, well why is that? Is it because it’s part of the database? Is it because it’s a team that’s underwater and they just don’t have the resources for it?” Even though I think it’s important as an engineer. And then you can start to have, well, you can start to understand how to advocate for those particular action items that you actually think are important, or understand why certain action items just don’t get done. And again, I think that understanding, I know you and I when we were talking a lot about DevOps, and we always used to talk about you have to understand the business, that’s a core component of it. And this very much is understanding that sort of under the surface, how do business decisions get made within an organization? Again, it’s very different. So, this is a tool and hopefully a toolbox for engineers who, again, great engineers, they’re coding, they’re responding to answers, doing all that stuff, and they may not have huge insight into what the business is doing. This is a tool to start poking, to understand what’s going on under the surface so that you can be more successful in the organization again in which you’re working.

Mandi Walls: Yeah, absolutely. Start to move from tactics to strategy a little bit. And I think that that’d be an interesting thing to do. If you’re an engineer and you have a significant number of incidents, any particular quarter, take all your action items, do yourself a little inventory and see what got done and what didn’t out of all the things that you were involved in. I think that’d be super interesting for a lot of teams.

J. Paul Reed: Yes, 100%. And also track which items got changed into something else.

Mandi Walls: Yes.

J. Paul Reed: The story got changed into, “Oh, that’s not a remediation item. That’s a project now.”

Mandi Walls: Projects. Yep, absolutely. Oh, this is super neat. Okay.

J. Paul Reed: Yeah, yeah.

Mandi Walls: Well, sir, thank you so much. Do you have any parting thoughts for our listeners? Encouraging words?

J. Paul Reed: Yeah, just hang in there when you’re coming up with them, right? Again, doing some of the tracking that we talked about, even back of the napkin can give you some insight into what’s really going on.

Mandi Walls: Awesome.

J. Paul Reed: And hopefully that will decrease some of the stress.

Mandi Walls: Yeah, absolutely. Get a little bit more of a handle on it.

J. Paul Reed: Yeah.

Mandi Walls: Cool. Well, we’ll link all the stuff we talked about in the show notes, all of your articles and some of the links and stuff. Where can folks find you? Where are you living in the social medias these days?

J. Paul Reed: So this is so hard these days, right?

Mandi Walls: It is.

J. Paul Reed: I’m more or less J. J. Paul Reed everywhere. So again, I will not call it that one letter, but-

Mandi Walls: The artist formerly known as Twitter.

J. Paul Reed: Yeah, the social media site formerly known as Twitter, which is so funny. I see that in the newspaper now. So-and-so said on that letter, formerly Twitter, they have to call it out. Anyway, J. J. Paul Reed there. And then are we still doing Bluesky?

Mandi Walls: I’m trying. I like it.

J. Paul Reed: I’m J. J. Paul Reed there, but I don’t think I’ve logged in. Maybe I’ll need to make sure I’m following you, Bluesky.

Mandi Walls: Yes, we should put that together. Awesome.

J. Paul Reed: Oh, That one will always work.

Mandi Walls: There you go.

J. Paul Reed: There you go. And that links to everything.

Mandi Walls: Fantastic. Well sir, thank you so much for joining us again. This has been super fun.

J. Paul Reed: Thank you for having me. It’s always great to chat with you.

Mandi Walls: Awesome. And for everyone else out there, we’ll wish you an uneventful day and we’ll be back in a couple of weeks with another episode. Thanks. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at, and you can reach us on Twitter at pageittothelimit using the number two. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Show Notes

Additional Resources


J. Paul Reed

J. Paul Reed

J. Paul Reed speaks internationally on software delivery, critical incident response and management, operational sociotechnical complexity challenges and opportunities, Resilience Engineering, and DevOps. He’s worked with such organization as VMware, Mozilla, Symantec, and, most recently, Netflix.


Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.