MTTR and Beyond

Transcript

Mandi Walls (00:09): Welcome to Page It to The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter. Alright, welcome back to Page to The Limit. I’m here again with Rich. It’s been a while since Rich has been on. So Rich, go ahead and reintroduce yourself to the people.

Rich Lafferty (00:40): Yes, hi. It’s good to be back. My name’s Rich Lafferty. I am a Principal Site Reliability Engineer here at PagerDuty. I manage some of our reliability processes, our incident review process. I have a lot of opinions about incident metrics, which I think is what we’re going to dive into today. And yeah, that’s me.

Mandi Walls (01:01): Awesome. Yeah, so today we are going to talk about or drag through the mud, I’m not sure MTTR: mean time to resolve, mean time to repair, mean time to rock and roll, whatever it is that people are measuring with this somewhat ambiguous metric.

Rich Lafferty (01:25): Yeah, and it’s funny, this all started, this all started from a town hall talk. I can’t remember what was actually being talked about at the PagerDuty Town Hall, but at PagerDuty as in a lot of other companies, all of the action during a town hall happens in the Zoom chat,

Mandi Walls (01:43): In the chat, always

Rich Lafferty (01:44): In the chat. And so I’m sure we’re all paying attention and so forth, but it’s all happening in the Zoom chat. And we went onto this conversation where we got talking about incident metrics, we got talking about MTTR. I’m kind of known here that when people say things like MTTR or Root Cause or things like that, then everyone goes, oh, careful, don’t let Rich hear you. Which is also kind of not great I guess, but I mean it’s also as long as it’s funny. But yeah. And then we kind of got talking about it and we thought this would be a great topic for the podcast. We should dig into this. And I mean it’s not just MTTR, although MTTR is kind the poster child for, are you sure you’re measuring your incidents correctly?

Mandi Walls (02:23): Yeah. It really is sort of the top level one that folks glam on to as the thing that they’re trying to improve. And you’re like, okay, but what does it mean? What does MTTR mean to you? What are you actually tracking? And separate from what we give folks in the product because there is a set of analytics that provides some data to folks and you can do various manipulations to it. But when we’re actually looking at something like MTTR, what do we hope people are looking at? What are we driving for with a metric like that?

Rich Lafferty (03:06): Yeah, and I think the question underneath that too is well, what do you want to measure?

Mandi Walls (03:10): Yeah,

Rich Lafferty (03:11): What do you want to, so there’s incident metrics. If you want to have a couple of boxes, you’ve got a box which is labeled incident metrics, and it contains all of our usual M metrics. MTTR is the big one. The R can stand for a whole lot of things, but let’s talk about MTTR to restore like the end of impact. Let’s say there’s MTTD, which is time to detect or discover or some people call it diagnose, but it’s usually detection time, MTTA mean time to acknowledge. And so there’s a whole bunch of those. And then there’s of course MTBI, which is not an MTT metric because the third letter is B, but it’s still kind of in that whole category, which is the mean time between incidents.

(03:53): And so there’s a whole bunch of here’s how you might measure your incidents. The challenge I think is that sometimes people, it’s really easy to track these things. Well, generating the data can be easy or hard. It’s the first thing a lot of people come to. They don’t always question, well, what exactly is it that I’m trying to improve here? What elements of this system am I trying to improve? Am I trying to improve how we respond to incidents? Am I trying to improve and measure the reliability of the system that the incidents are happening on? And that’s kind of where it starts to get a little bit complicated. It all comes back to one of those great questions of DevOps, which is what exactly is an incident. So I’ve actually heard two really good framings of what is an incident, well actually lemme say three. The first one is an incident is whatever you call an incident.

Mandi Walls (04:47): Okay,

Rich Lafferty (04:49): That’s fine. But still, and at PagerDuty, one of our little taglines is don’t hesitate to escalate if you think something might be up, then trigger the major incident. Use the incident process. we acquired Jeli a little while ago, which is now post-incident reviews in PagerDuty. I’m really excited that we acquired Jeli, by the way. Wonderful people, wonderful product. Their version of that, I’ll probably misquote them, was if you don’t know if it’s an incident, it’s an incident. And so I think what you’re seeing, but the trend under all that is well, so what we label an incident is not very consistent. So just keep that in the back of my mind. The other two versions I’ve heard, one is one that I use for a long time is that an incident is an operational surprise, an acute or maybe an acute operational surprise. But I was reading a paper by Lorin Hochstein, whose name I’m probably going to mispronounce, sorry Lorin, and I’ll link to this later where he talks about it as a system that’s gone out of control.

Mandi Walls (05:58): Oh, interesting. Okay.

Rich Lafferty (06:00): And I read that one. It was like, oh, that’s exactly what it is. Now you still have the problem from the first one where it’s like an incident is what you call it an incident. You might trigger your incident process and find out that you were wrong. The system is actually running fine, the system is not out of control, was never out of control. And so then you’re kind of going like, well, if we triggered the process, is that an incident? But I mean, that might be a little angel angels on the head of a pin at that point, but if you think about it as a system that’s gone out of control that you need to bring it back under control, I think that’s probably a pretty good definition to work from.

Mandi Walls (06:38): Yeah, I mean we use, I think in our documentation, at least on the community side, our documentation isn’t any unplanned disruption or degradation of a service that actively impacts customers. So we’re injecting that customer side of it into there kind of intentionally so that you get a little bit more fire under your butt because there’s customer impact there. But

Rich Lafferty (07:05): Which is interesting because obviously if you’re impacting customers, then that’s probably an incident. If the system goes out of control and you respond quickly and you bring the system back under control before customers

Mandi Walls (07:20): Notice,

Rich Lafferty (07:21): That might still be an incident. I’m not sure. But so even if you’ve count the things with impact, you’ve kind got this timeline, the system goes out of control in some way, and then that leads to some kind of impact. And then we detect that either that the system is out of control or that there is impact, then the response begins, then we do a bunch of remediation, the system comes back under control, then the impact ends and then the response ends. And so you’ve got all of these little time markers. I’m gesturing here, no one can see me gesture. So you’ve got all these time markers in a row and all these M metrics are between particular ones of those times.

Rich Lafferty (08:06): You might say that the MTTR is, or sorry, the TT R, we’re not averaging anything yet.

Mandi Walls (08:11): No,

Rich Lafferty (08:11): The time to resolve repair, restore is between the start of impact and the end of impact. Probably.

Rich Lafferty (08:19): Probably other people might measure it, but probably time to detect is probably from the beginning of impact till detection.

Mandi Walls (08:27): And you may not know that until much later.

Rich Lafferty (08:30): Until much later

Mandi Walls (08:32): When the impact started until you’ve done your investigation.

Rich Lafferty (08:35): Sometimes you have an impact begin and you don’t detect it, and then the impact ends and then you go back and you go, oh, at PagerDuty, we call that like a silence SEV 1 or a for whatever the impact is silent SEV 1 or a silent SEV 2. But still, even if you’re just looking at MTTR, then you still got, okay, the impact starts and the impact ends and got a time period in there. And that feels important, kind of naturally important. But how long were our customer effect? Is this definitely something you should care about? And I want to make clear when I complain, because I do complain about people using MTTR when I complain about people using MTTR. It’s not the TT R part.

Mandi Walls (09:11): No,

Rich Lafferty (09:12): It’s the M. The M. It’s the M. And the trick there, the example I always use is the thing I like to use to kind of throw people off and get ’em thinking about this is as you improve the resilience of your system, you are going to encounter fewer incidents and you’re have fewer incidents, and the things that are left are going to be the really weird ones.

Mandi Walls (09:34): Yes.

Rich Lafferty (09:36): What that means is that as you improve the quality of your system, your MTTR is going to increase. It’s going to increase. Yes. Which is kind of like the fundamental, okay, something’s wrong here for this. Now

Mandi Walls (09:52): Your N is going down, the number of things you’re averaging across is going to probably go down. It’s just their impact is going to expand because they are potentially so much weirder than the things you started out with.

Rich Lafferty (10:03): Exactly. And I mean that’s kind of it, right? Is the denominator of your mean is the number of things you’re measuring. And so that starts to just to get a little bit tricky. But the other thing is even now, if you’re at the beginning, so let’s, if we use that to say, here’s why you need to think about this carefully.

Mandi Walls (10:26): Yes.

Rich Lafferty (10:27): A lot of folks listening, they’re going to be closer to the beginning of their DevOps adventure, and they might still have a lot of small, relatively trivial, let’s say I say small, they don’t have to be small. The impact might be big customer facing things where, I don’t know, let’s say just like the change failure rate is still high.

(10:51): And so they have a lot of things where they release something wasn’t caught in testing and they roll it back. And so at the beginning of that adventure, it might take an hour. It’s funny, I say an hour, that’s a long time, but I imagine for some folks it’s like a week, a month to detect this and then roll it back and we want to improve and we want to get better at that. And so now we’re going to try to reduce that to, okay, so we still have a high change failure rate, but we’re going to, now we’re catching it in 45 minutes, now we’re catching it in 20 minutes. Now we’re catching it in 10 minutes. Our MTTR is improving. So it’s not ridiculous to use it when you’re kind of at the beginning of that,

Mandi Walls (11:34): But how do you know when you’ve outgrown it, when it’s no longer?

Rich Lafferty (11:39): Which is a good question. And maybe the thing is, what is there that works for both? Now, I mentioned change failure rate, which of course is one of the DORA metrics, one of the four keys. And so there’s another metric that’s in front of you is how many can you reduce the number of, not just the number of incidents, but the number of times you have an incident related to it before you let a change failure escape, let a bug escape. But the other thing is what I’m basically talking about when I’m talking about using the mean denominator is going to get you in trouble. It’s that these things aren’t comparable. The case where it works there, they’re comparable. You have the same kind of incident, a change failure, how long does it take you to detect it and roll back? You’re really comparing the same things. And then as you get better and better, they become less and less comparable. And I think that’s what I like about the out of control part.

Mandi Walls (12:31): Okay, yeah, I see that.

Rich Lafferty (12:33): Right. So actual, there’s a different framing from that, which is what Lauren used in his post, and he went back to Edward Deming, the Toyota system and so forth, and a whole bunch of people just went, oh, no listening to this. Oh, we go, oh no, it’s Deming. Here we go. And I promise I’m not going to promise do that. Go back to first principles, if you go back to first principles on this. Anyhow, what he talks about is statistical control. I had never heard this term before. And the idea, if you go back to Toyota system and stuff like that, is that they’re doing manufacturing and the tolerance of parts or whatever, I’m going to pretend that I know about manufacturing is all within some tiny margin. And then sometimes there are some parts that are outside that margin.

(13:23): And when you have a whole bunch of things that are around the same area with occasional outliers, then you can say that that process is under statistical control. And when a process is under statistical control, then you can generate statistics on it. It is useful to know when 99% of your widgets are within parameters. It’s good to know about the 1% that aren’t. But if we rewind it all the way, and I said that an incident is when a system is out of control. Oh, those things are related. Those things are related, it’s the same kind of control. So I think the outgrown it part is that when you’re measuring MTTR on something that’s failing the same way off, and then you actually have something that might be under statistical control.

Mandi Walls (14:12): Yeah.

Rich Lafferty (14:13): The system is not actually maybe going out of control there, just releasing. You’re just, you’ve got to change failure and all those, the Dora folks clearly know what they’re doing. All of those DORA metrics are things under statistical control out of all of your changes, regardless of whether or not they became incidents, what percentage fail or the, what’s the other one? Recovery time failed deployment, recovery time is the other one. And so of your failed deployments, how long does it take you to recover them regardless of whether or not they had a customer impact, regardless of whether or not they had a severe customer impact, which is also what you sometimes measure. So yeah, you might go the failed deployment one is interesting too, because the deployment recovery time is interesting too because it’s like, that sounds like MTTR to me, but what’s different is that statistical control part. You’re looking at all of your failed deployments. You’re not measuring the impact you’re looking at, here’s when we released it, here’s when we recovered from it. And actually, and that’s something that you can actually measure.

Mandi Walls (15:27): Yeah, I mean, it sounds like there’s a case for different categories of failure and different groupings or granularity of how things fail that maybe makes MTTR more or less useful, maybe across different classes of failure versus just kind of pushing everything into the same N and using one big mean.

Rich Lafferty (15:57): Yeah, I think there are definitely ways to make it better. Although I think one of the things we’re talking about is that if things are out of control and they’re not in a statistical control, and the problem is you’re changing the denominator and so forth, is what you’re actually talking about is the distribution. These things are not, bell-shaped, I mean, some of it they might be, but there’s no reason to think that they might be. They

Mandi Walls (16:20): Probably not. When we’re looking at the full case of errors and things that happen across the system, you have a lot of weird little things that you solve very quickly. And then we’ve seen more of those sort of in the first 5, 10, 15 minutes of the lifecycle of the incident. And then there’s the long tail of long running things is a lot smaller usually than what we see of the big agglomeration at the first few minutes of problem.

Rich Lafferty (16:48): Yeah, I think that’s really common with a long tail. The other distribution that you often see is sort of a bathtub where we have a lot of things that are really fast. And we talked about as you get better this, the incidents get weirder and weirder and have longer impacts and stuff like that. So you have, and you can try to make it so you can push down one or the end the other. But what we’re actually saying here is that, oh, the shape of the distribution is useful. The shape of the distribution is a property of the incidents we’re having. And when I hear the shape of the distribution is useful, then the first thing that comes to mind is we should bucket it.

Mandi Walls (17:20): Sure.

Rich Lafferty (17:23): We’re getting into the space where we actually care about percentiles. We actually care about the shape of it. And at that point, if you want to know the shape of your percentiles, then you’re really talking about histograms.

Mandi Walls (17:32): Absolutely.

Rich Lafferty (17:33): One way that I like visualizing impact rather than averaging it all out, is to actually bucket it into whatever buckets are useful, and then the way you can tell if you’re improving is by how the shape of the bucket changes or how the shape of the histogram changes.

Mandi Walls (17:52): So the distribution across buckets, you’d bucket for what time

Rich Lafferty (17:57): For impact time. Yeah, for impact time. So rather than if you’ve got two different companies or two different teams or whatever, and one of them have a very small number of long impact incidents, and the other one has a couple of long impact incidents and a lot of small ones, a lot of really short ones, a lot of really short ones. You could go, okay, I want to see both the mean and the N, which is fine, and that’s maybe something that your BI system or whatever can handle better. But what I think is really useful here is actually drawing out the histogram and showing what that distribution looks like, because humans are really good at noticing those patterns.

(18:40): And if you show the changes in that histogram over week to week or month to month or whatever, quarter to quarter whatever period you’re looking at, then you can really start to tell, well, how did that shape change? And that’ll give you a little bit more information about not just are we having fewer incidents, which is a reasonable question to ask, but you’re really looking, it’s like, well, what kind of incidents are they? And you could do severity where it’s like by time then you bucket it that way and so forth. But if what you really care about is how many incidents had this long an impact, then what you want to look at is how many incidents had this long an impact in that histogram?

Mandi Walls (19:19): So you’re looking at frequency of that more than anything else.

Rich Lafferty (19:25): To rewind the conversation a little bit though, at the very beginning I said that it’s really important to understand what it is you want to measure.

Mandi Walls (19:32): Yes.

Rich Lafferty (19:33): And we’ve been talking so far, we’ve really been talking about the reliability of the underlying system mean there’s definitely, there’s a piece in there which is how good is the team at diagnosing and recovering from this incident? That’s another thing. But what we’re really saying is we want to understand if the impact to customers is getting better or worse. And this is one of those things where when you say it, it sounds so silly, but one thing that you can do to understand if the impact from customers is getting better or worse is you can actually measure the reliability of the underlying system instead of measuring the time it takes in the incidents on it. And of course what I’m describing are SLOs service level objectives is SLOs. I feel like sometimes people go to a lot of effort to build something that is roughly providing the data that SLOs could provide based on incident metrics, which is sometimes the metrics you have, but also if you really want to understand fundamentally are the users, what are we subjecting users to, which is the flip side of our users happy? Then measure the availability, and then whether or not you declare an incident around a particular availability issue, you’ll still catch it and so forth.

Mandi Walls (20:48): Yeah, and we’ve been talking about that SLOs for the last couple of years, and as I travel the world asking people questions and talk about some of this stuff, what I’m finding is that unfortunately folks aren’t engaging with SLOs as a concept. I’m not sure if it’s something that maybe they’re not seeing exactly what they would get from it, or they’re maybe not thinking about their systems in that way. I’m not sure where the gap is just that if I ask the room who’s working with SLOs, the hands don’t go up, and I feel like we’re really missing a lot of opportunity to improve our systems by not engaging with that.

Rich Lafferty (21:35): It’s interesting too. I mean, even at PagerDuty, I would say we’re not all the way along the maturity maturity model when it comes to SLOs, and I mean it is work. You do have to figure out how to measure them, and sometimes it’s not obvious how to measure them, but it’s one of those things where kind of a half good SLO is still better than having none. The thing that really surprises me about people not investing in SLOs is that it’s kind of a, well, nothing’s free, but it’s a reliable and relatively low effort leading metric,

(22:13): And you can detect that your system is degrading before it impacts customers. Just seems like such a big lever to me that people don’t always get it, but it really does. It requires a different way about thinking about the systems, and it does require the work, and of course you have to actually look at them and all of that stuff, but the underlying is like, well, the reason that we measure MTTR is we want to understand the availability of our systems and then say, well, alternatively, you could measure the availability of your systems and measure it directly. The other side of that though is that there are other things you might want to measure. An instance, which is our ability when an incident happens, are we good at responding? Are we good at responding to it? Is our ability to operate our systems improving? That’s also valuable, but I’m not sure because incidents aren’t comparable necessarily. I’m still not sure that MTTR is going to get you that.

Mandi Walls (23:11): Yeah, it feels like we’ve been optimizing for our own pain and not necessarily for the customers or the user’s experience there. And I think the SLO usually focuses more on what the users are experiencing, and when we’re looking at things like MTTR, and that’s all a reflection of how we feel about the system itself. That’s part of our pain, I guess, being engaged in an incident and working through that, when does our pain end? When can we say this is resolved and we’re done working on it versus focusing on the customer aspect of it?

Rich Lafferty (23:52): Yeah, I mean, we’re still measuring impact, but I think you’re right that there is something underneath that where it’s like incident response and incident resolution is a capability that companies need to have and the individuals in the company need to have, and we want to tell if that’s improving or not. And what’s funny about that is that this, where actually starts to get hard to measure really talking about is how are our people performing? That’s the other thing you want to do. You want to go, if you want to take a really simple incident metric, that’s a people performance thing that you can take meantime to acknowledge because PagerDuty sends a notification, the clock starts, someone hits the acknowledge button, or they go onto the zoom call or onto the slack, however you want to measure that. Then the clock ends and you can tell if that gets better or worse or so forth. And those are actually probably pretty comparable between incidents because you don’t know what’s happening yet.

Mandi Walls (24:50): You don’t know how important it is yet. It’s just like, what have you been sort of culturally trained to do? We do certainly have customers where there are certain teams that their acknowledged time is a lot longer than their other places, but those expectations have to be set up front too.

Rich Lafferty (25:11): Yeah, absolutely. And it’s important. On the other hand, at the same time, I find it’s a little bit shallow. Obviously, if you have a problem that you’ve got an SLO or you’ve got, I mean, you don’t have SLOs, you just have really strict SLOs. It turns out if you don’t have an SLO for your availability, then it’s a hundred percent and so on

Mandi Walls (25:32): All time,

Rich Lafferty (25:32): But you’ve got some kind of commitment to a customer or to your customers in general that you’re going to resolve issues in a particular amount of time

(25:43): As best effort, as whatever. Then the amount of time it takes to actually get someone on a call and looking at it and stuff like that is definitely important aspect of that. And if that is really long, if for whatever reason you commitment to your customer or your desire to provide a service to the customer and the amount of time it takes people to acknowledge incidents don’t match, then yeah, that’s a problem, and you might want to set some new norms on that and stuff like that. But when I start thinking about shallow, I think about, there’s an old blog post by John sba. Of course, John SBA has to come up. Of course, we can’t have this conversation without John Allspaw coming up back from 2018 called Moving Past Shallow Incident Data. Yeah, I’ll make sure. Yeah, I’ll make that one gets in the show notes as well. Yep. It’s on the

Mandi Walls (26:29): Kitchen soap blog somewhere. We’ll find it.

Rich Lafferty (26:32): And it’s a tricky one because John really focuses on deeper human performance data and on qualitative data,

(26:41): And at that point really some of his metrics in there are things like how many people attend incident reviews, how many people read incident reports? That one’s actually been really fun. I rebooted, well, we rebooted. I led the reboot of our incident review process a few years ago, and one of the things we got out of that reboot was analytics on incident reports. How many people have looked at it? How long did they spend on the page? Stuff like that. That was just really interesting to see. I’ll admit it’s not something that we focused heavily on.

Mandi Walls (27:09): Oh, sure.

Rich Lafferty (27:10): But it was really interesting to see because kind of measuring how engaging was this writeup to other people, and from there you can go like, well, how do we make these more engaging? If you want people to learn from what happened, then their engagement with the process of learning from what happened is very important.

Mandi Walls (27:28): Absolutely. This actually came up as an Open Space at DevOpsDays London this year in September. That was one of the things people posted, how do I get people to read and learn from the incident reports? We spend all this time creating these assets, but do we have to beat people over the head to get them to read it,

Rich Lafferty (27:49): To get them to read it? And I mean, I think some of that is just like, do folks have slack time?

Mandi Walls (27:54): Yeah,

Rich Lafferty (27:55): Small-s slack, not Slack-Slack, but do people have spare time? Is it valued? Is it important? And things like that. I notice we’re getting way off of MTTR at this point. It’s all related. I do notice that sometimes engineers think that there’s a particular kind of formality that goes into an incident report, and maybe that’s from reading public ones, which do have a bit of a formality to them because they’re from the organization that had a problem to the companies that are paying them for their service. But the one thing that I often repeat when it comes to incident reviews is that it’s a storytelling activity.

(28:41): Humans for thousands of years have learned by telling stories. You’ve probably all read an incident review. The reveal is kept late, or sorry, not an incident review, a post a postmortem where they’ve held onto the reveal, right? They explain, they really kind of put you in the seat in the position of what they were seeing and when they didn’t know what was going on. When you’re reading the story, you don’t know what was going on either, even if you had found out, maybe even you’d already heard what was happening and so on, but you really get into the whole DevOps murder mystery of, oh my God, I don’t know what’s going to happen here either. And giving incident view authors room to tell the story of being there, especially for internal ones where there’s a little bit more flexibility, but if you really come out and say, your target audience for this is your fellow engineers, yes, senior leadership going to read it, or maybe we’re going to write up a different more formal thing for senior leadership, that’s another thing that you can do, but your goal here is to tell the story of what you just went through so that people that weren’t there can learn from what it was like to be there. Then that can get the engagement, that can get the deeper incident data, but even then at that point, you’re starting to go, how is our, just to pull this back a little bit, we’re going, are we measuring the incident process? Are we measuring our availability? Are we measuring our ability and so forth?

Mandi Walls (30:16): I’m like, which part of those then becomes our goal? Do we even have goals across any of these potential metrics? Are there things that we need to be focusing on or should be focusing on to improve over time? Is there part of this that we’re better at or worse at, or we’re missing a tool or something like that? Do we have the data to even make those decisions based on some of this stuff?

Rich Lafferty (30:40): And that’s really interesting. I’ve never really thought, you know what? Okay, you’re measuring MTTR, you want to improve it. What is your goal? Hasn’t actually occurred to me before because it’s such a funny question. I don’t know what people would say if you ask them that.

Mandi Walls (30:54): Yeah,

Rich Lafferty (30:55): You want it to be lower. I mean, we all want a ARR to be higher, but we still set ARR goals, right? Revenue goals. If we got a revenue to be higher, but we still set revenue goals or

(31:06): Sales folks, we want them to make more to close more sales. We give them close this many leads. Even in some of the silly little things like we want, if we want engineers to close this many tickets, hopefully we’re not measuring engineers by the tickets. They close, but still, you don’t have to say, well, we need more, so we’d like to get it to this point, and I don’t know how you even MTTA could say for this service we want to have, we’re aiming for a response time in five minutes or 10 minutes or whatever, right? Yeah, we see that definitely. But how do you set a goal for time to resolve any incident?

Mandi Walls (31:45): Well, do you even bother? Is it just something that you’re looking at as a trend over time or for the amount of things that we see some teams actually own? There’s even places where they have so much stuff, it’s so different from each other that even having a team data point is no longer information. It’s just kind of junk science sitting there because some of their services are so disparate that it doesn’t have any meaning anymore, but as an overarching metric. Yeah, I think the M is the starts with the bad part, and then as you add more things into it that it just pollutes the whole potential process, the whole data pool off of it.

Rich Lafferty (32:34): Yeah, no, that’s exactly it, and so it really does come back to, even though you have these measurements at your fingertips, you need to be very careful about what exactly is it that you’re trying to measure? What are you trying to improve? That’s one thing that DORA gets right with their stability metrics, which are different than incident metrics because if we go all the way back to the beginning of our chat, what the heck is an incident anyhow with the stability metrics is that they’re saying, on one hand we’ve got these stability metrics change failure rate and the deployment recovery time, but those are balanced by their throughput metrics, which are the change lead time and the deployment frequency, and the reason they have those two sets of two is that their whole thesis is you can improve both of these at the same time, but if you improve one and the other declines and you might be heading in the direction you don’t want to head in.

Mandi Walls (33:34): Yeah.

Rich Lafferty (33:36): I should also mention that in recent DORA reports, they added a fifth of. They say, you should also have SLOs, as we pointed out before. Why doesn’t Dora say that? Well, they do now. So it’s just so much of it is just like, watch out. Make sure you understand what exactly it is you want to improve and why this number is that; watch out for statistical control. Is this something where you’ve got most things within a particular band and you’re trying to deal with outliers, or is the data just so highly variable that you’re not going to be to draw any conclusions by comparing it to each other?

Mandi Walls (34:15): Yeah, I think that it makes it a more subtle problem, makes it a long term, makes a more interesting discussion to have as you’re not just assuming everything is the same and clumping it in there.

Rich Lafferty (34:28): I mean, the big thing there I think is just like there’s a trap. MTTR is a huge trap. A lot of these measures of central tendency are a big trap because the statistics don’t necessarily support them, but they’re also a huge trap because you really have to be clear about what you’re trying to measure. Are you trying to measure your incident performance? You’re trying to measure the performance of the system that the incidents you’re trying to resolve? Are you trying to measure the performance of your people? Are you trying to just make sure that you’re learning and improving from all of this? Is the goal of fewer incidents? Is it resolving them faster? It’s probably both. What does that look like in your numbers? And to just be very careful when you’re looking at all of these possible things to understand what the meaning is, what happens to these numbers when the system does improve, and that you’re really looking at the numbers that your business cares about and not the ones that are necessarily directly there in front of you.

Mandi Walls (35:24): Yeah, yeah. The easiest ones to collect aren’t necessarily the ones that give you the most information.

Rich Lafferty (35:30): That’s exactly it. That’s exactly it. And sometimes the behavior isn’t linear like we were talking about at the beginning there. When you start improving, that means that you are going to have fewer incidents that are weirder longer to resolve and so forth, and that’s a sign of improvement if you end up in that situation. Since you’re not going to have a perfect system, if you end up in that situation, then you’re in a pretty good place.

Mandi Walls (35:52): You don’t want to be seen the same easy stuff over and over. You want

Rich Lafferty (35:54): Those exactly.

Mandi Walls (35:55): Weird stuff most often. Excellent. Well, I hope this gives folks some things to chew on. I know in the world of the wonky people who really dig into this stuff, there’s a lot of chatter and debate and back and forth on different metrics and how to use them and how they apply and all that kind of stuff. For the lay person, MTTR looks like an easy thing to sort of grapple onto and get a handle on as a first measure of getting out of drowning in your incident pool. So

Rich Lafferty (36:28): Yeah, I think that’s it. And just going from MTTR to histograms or making sure that you’re actually looking at your service level indicators and not just your incident metrics, it is going to be a good step towards going, even if that MTTR number changes, that’s going to give you the stuff you need to say, well, why did it change?

Mandi Walls (36:49): Yeah. Yes. Get in there, get those SLOs, find your SLIs. Let’s get in there and get some good data.

Rich Lafferty (36:56): Yes, and also I love the idea of incident wonks, like policy wonks, right? But that is exactly it. I’m going to start labeling myself that now. I am absolutely one, and

Mandi Walls (37:07): You can probably count them all individually because there’s only a subset of those folks around.

Rich Lafferty (37:15): Absolutely.

Mandi Walls (37:15): Yeah. We’re here to share all the info with everybody else, so

(37:20): Excellent. Well, this is great, and if folks out there have other questions or you have things that you want to talk with us about, you can reach out to us on the PagerDuty community. We’d love to have a conversation about this and how you are tackling your metrics with your team and how you’re discussing these kinds of things, and even how you’re discussing it with your executives. What are you actually bringing to light and how are you asking for maybe more resources based on the data that you’re seeing? How is this impacting your team over time? It’s all related to all of that stuff, so we’d love to hear from you about that stuff as well. So yeah, this is great. Awesome. All started on a town hall chat conversation.

Rich Lafferty (38:04): All good. I have no idea what the town hall subject was about, but it was a good Zoom chat.

Mandi Walls (38:08): Yeah, I don’t remember at all, but then this side conversation started and I’m like, I should record this.

Rich Lafferty (38:15): We need to record this one. Well, I always love coming on. Thank you for having me on.

Mandi Walls (38:19): Excellent. Thank you so much for being on with us again. We love it. For everybody else out there, we wish you an uneventful day and we’ll be back with another episode soon. That does it for another installment of Pager to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pager to the limit.com, and you can reach us on Twitter at pager to the limit using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.

MTTR and Beyond

Transcript

Show Notes

Additional Resources

Guests

Rich Lafferty (he/him)

Hosts

Mandi Walls (she/her)