Incident Communications with Kat and Mandi

Transcript

Mandi Walls: Welcome to Page It to The Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @lnxchk on Twitter.

Mandi Walls: All right. Welcome back Page It to The Limit. This episode, just Kat and I, because we’re going to talk about stuff we know a lot about and that is incident communications. So, we have conversations about it internally as we change our policies about the things that we provide to customers during incidents when we’re dealing with things. So, we’re very prescriptive about what goes where and who gets to tell people what and all that kind of stuff. But a lot of folks aren’t to that point yet with how they communicate about incidents and that can be a struggle for their teams.

Kat Gaines: Yeah. Sometimes figuring out can really be just a black hole of information where you have the problem, I think, both of, there’s a lot of information out there on how everyone does it and everyone does it differently, and then there’s also no information for your specific needs and use case, right? There’s a lot of tailoring, there’s a lot of testing things out, trying them, failing, saying, “Oh, well that didn’t work,” and going back to square one for folks. And I think that we’ve both seen a fair amount of what that looks like, and so we’re going to get into a little bit of that.

Mandi Walls: Absolutely. Let’s start with internal communications. That’s stressful enough before we even think about what we’re going to tell anyone external to our organization. I know we have a lot of folks out there, their customers might be only internal, as well, right? So, talking to the rest of your organization, how those things go, and where you want to put all that stuff. So, at PagerDuty we’ve got a combination of things. We have folks who subscribe to an actual set of services that represent our big incidents. So, if there’s a major incident, there’s a service that represents that, and you can subscribe to updates about that thing. So, when we have a major incident that goes on a bunch of engineering management, other folks will get notified that something is going on. Then we also have some hookups into Slack so that there’s a Slack channel called Major Incident Room, and you can hang out in there during an incident and you can subscribe to that, as well, so that folks know internally and that’s more real time stuff. It’s constant chain. We do our scribing in there. There’s a lot of stuff going on, so that can be overkill for some folks, too.

Kat Gaines: Yeah, I think there is some problem sometimes with certain internal teams of too much information. So, for example, when I was on the support side of the house here at PagerDuty, we would often wrangle internal communications, as well as external ones, right? And especially when you have new hires or new teams or even new leadership on certain teams who don’t necessarily know where to go, you tend to have a lot of folks dipping into maybe those realtime communications in the Major Incident Room. They don’t necessarily realize because maybe they haven’t just gotten a chance to get the training yet that the scribing that’s happening there is purely for scribing. It’s not where a lot of the communication is happening, that there’s a separate call going on that they’re not part of, and they may start asking questions, for example, because they’re anxious and they want to know what’s going on. And that’s okay. Incidents are, by nature, anxiety driven, and you want to be able to support your team, your customers, whoever you’re working with to the best of your ability. But it’s distracting, too, to jump in and say, “I need to know what’s going on because customer X is breathing down my neck.” And so, I think there’s a lot to be said for having really good internal training and just being able to make sure that from the moment people show up at your company, they know that, if there’s an incident happening, they know where to go, they know where to look for updates. Obviously, you can use PagerDuty to send out subscriber updates, stakeholder notifications, those types of things. But it’s really about just having some kind of process in place-

Mandi Walls: Yes.

Kat Gaines: … and making sure it’s not something that’s just buried in a little piece of documentation that maybe somebody might read never, but that it’s something you yell about and people know where it is right, because that’s going to prevent a lot of that distraction. I don’t think everyone always realizes how much of a hindrance that can be to actually get an angle of resolving the issue when you’re off going, “Okay, I need to answer this. Forgive me, but stupid question that came in to the infinite flag channel,” instead of going and troubleshooting the actual issue.

Mandi Walls: Yeah, absolutely. When we try to nip a bunch of that stuff in the bud as often as possible, we do have occasionally execs or other folks join a call and get exercised about what’s going on. That’s totally understandable. Like you said, there’s a lot of stress there. Incidents by themselves are, we hope, rare, right? So, not everyone has a lot of muscle memory around exactly what’s going to happen, how all the mechanics work. We have documented our process for folks that response.pagerduty.com. You can read through the things that we practice with incident commanders and like you’ve mentioned, the scribing, which is really just there for help and documentation and visualization more than anything else. And then, just occasionally, we have to step in and someone comes in and starts asking a bunch of questions and derailing things. The first question that I see is going to ask is, do you want to take over as incident commander? Do you want to run the call? And it’s totally okay if you do, we will step back and deputize. That’s fine. But most of the time it’s just folks being nervous about what’s going on and feeling like they don’t have enough information. So, we try to have a cadence on how often we update. This is what we found, here’s where we’re still debugging, here’s the teams that are involved, and that kind of thing. And then also, the SMEs that are on the team are often usually back channeling with the rest of their team to get answers to questions, making sure they’ve got the right expertise, and are headed in the right direction. And other folks might join the call to help out just unofficially there. So, there’s a lot of places where we do that. I have been in calls in my dark past where the CTO or some other muckity muck joins the call and wants a status and just wants to cheerlead, right? They come in and, “You’re doing a great job. It’s amazing. Yada, yada yada.” You’re just like, “I don’t need this right now.” It’s horrible. It’s so nerve wracking because, yeah, this is your boss’s boss’s boss’s boss or whatever. It’s just like, “Why are you even here, man?” In our [inaudible 00:07:01], we refer to that as the executive swoop and poop, which I find absolutely hysterical. It just reminds me of seagulls. If you’ve got food, man, they’re coming after it, and it feels like the same idea. But yeah, it’s so distracting.

Kat Gaines: It’s so distracting. And I think that person is always well-intentioned if they’re coming in, if they’re asking questions. Again, they’re just anxious, they just want to know. They probably have someone else who’s their stakeholder breathing down their neck. And if they think they’re just cheering people on, they’re just like, “Oh yeah, that’s going to motivate them. That’s going to make them feel so good.” No one feels good right now. We don’t want to feel good right now. We want to just deal with this so that we can get to the end of it and go back to sleep or back to normal work, whatever it is. Yeah, it’s the worst. There’s so much to be said for, you were talking about having cadences and also having SMEs going back to their team. I think there’s a lot to be said, because we talk a lot about external process, but having just as strict of an internal process as you do external. So, having just as well documented and baked out update cadences, really understanding when and how you communicate to people internally. And then, also I think not just putting the onus on one person. We do talk about having an internal liaison-

Mandi Walls: Yeah.

Kat Gaines: … as the internal comms person, and you should have that one person, but also not putting the onus on that person to be the only source of knowledge of what’s going on in the incident for the whole company. Having those SMEs able to scale that information back to their teams as well so that if there’s someone asking questions, there are multiple places the information lives. Maybe it’s in your team channels, as well as the official update channel. So, if they’re not checking the update channel, we’ve all done that thing where we take a Slack channel and we just mute it, and then we accidentally mute it forever and forget it exists. But if it’s a channel you’re working in every single day, and you see someone who is maybe just cross posting an update from the updates channel or something like that, you’re less likely to panic and go hunting for information where you’re not helpful when you’re doing that.

Mandi Walls: Yeah, absolutely. And especially in the weird combination of distributed systems plus monolith that we have here. There’s a lot of things that have tentacles back into the monolith, so there’s an interesting combination of teams that might have to get involved in a particular incident. So, yeah-

Kat Gaines: Yeah.

Mandi Walls: … having as much comms out internally is super helpful during the incident.

Kat Gaines: Yeah, that’s a good point. As your organization grows, too, I remember when I joined PagerDuty in 2014, there were a number of small teams, but pretty much everybody was across two offices. And so, you could just walk over and be like, “What’s going on? Oh my God.” And you can’t do that anymore. Being able to scale processes out as your team and your company grows, too, you have to plan for that, and you have to be able to think about that. And that’s another just point in being able to scale out who is responsible for information, as well. You can definitely have just one person responsible for all of the comms when your company is 60 people total.

Mandi Walls: Yes. Yes.

Kat Gaines: But when you are five, six times that amount, you really need to think about what that looks like in terms of scale and how it grows with your volume, too.

Mandi Walls: Yeah, definitely. And then, layer on top of that external communications, and this is the part I think people love to get super spun up about. What are we going to publish? Where are we going to put it? How often is it going to go there? Do we also put it on all our social media? And have all these other additional questions. How do you communicate best with the users that are impacted without upsetting the apple cart for absolutely everyone who might not be impacted? Especially if you’ve got a feature that only affects a certain number of folks, if you put out an all points bulletin, everybody’s like, “Should I be seeing this? What’s wrong? I don’t see it. Is that a problem?” And freaking out, as well. So, there’s a balance to find and being as specific as possible while you’re still trying to figure out what’s wrong without freaking everyone else out, too.

Kat Gaines: Yeah. There’s something that I wrote down a couple of weeks ago about assumptions in incident communication, and it can go both directions. It can go in the direction of assuming that your audience is sitting in your stack all day every day, and they know exactly what’s going on.

Mandi Walls: Yeah.

Kat Gaines: And every silly internal term, everything that you refer to internally is going to scale out to them. So, you might assume that they understand all of the naming of your internal tools, all the naming conventions, that language that is really spoken only by people who work at your company, tools that deliver parts of your services, code names you have for launches that are coming up-

Mandi Walls: Right.

Kat Gaines: … engineering team names or, even better, freaking acronyms. All of those things because you work at the company, it’s so baked into your language and core of what you’re thinking about day to day, that it’s really tempting to write messaging that uses those things. And then, when anyone outside of your company read it, they’re like, “What? How did we get here?”

Mandi Walls: Yeah, yeah.

Kat Gaines: And then, there’s the other side of it, too, which is assuming that your audience is stupid and they don’t know anything, and then just not telling them enough.

Mandi Walls: Yeah.

Kat Gaines: And feel like I see a lot of both sides of those camps, and it’s really hard to finesse the happy medium in between the two where you assume that your audience are intelligent human beings who use your product, and therefore have some understanding of it. But they’re not in your backend day-to-day, and they’re not in your team rituals and the silly acronyms and names that you make up to make your work interesting and fun day-to-day, either.

Mandi Walls: Absolutely. The product naming part of that is already hard enough. And then, if you’re doubling up on names internally of the things you’re also calling things when you ship them to your customer, that’s going to cause a lot of confusion. We have the problem just naturally. We’ve got integrations, but the word integrations means a lot of different things.

Kat Gaines: It does.

Mandi Walls: But for the users, they’re thinking mostly about how a third party integrates with a service, and that’s a very specific workflow. However, we’ve got a bunch of other things that work the same way under the hood, but the customers don’t necessarily have that viewpoint into it. So, when we are talking about integrations, then we have to de-duplicate and figure out exactly what part of integrations we’re talking about and which team actually knows that thing. And some of that stuff can get really confusing for people. So, our external comms have to be really focused on what parts do the customers actually know, and what they can see, and how it’s going to impact them. And not, “Project Athena has taken a stack dump, and we can’t fix it right now.”

Kat Gaines: Yeah, exactly. It’s about the symptoms, right?

Mandi Walls: Yeah.

Kat Gaines: It’s about the symptoms that a customer is going to experience. They don’t really care about how it got there, or if I’m going with the analogy that I started here, which maybe I shouldn’t, but what the specific disease is or what medications they’re going to treat it. They care about the symptoms and how do you make it stop?

Mandi Walls: Yeah.

Kat Gaines: And that’s what you need to focus your calms on. What are they going to experience in the product itself. When they log in, what is going to look weird or broken, or what is simply going to not work. Are they going to be able to log in even? Those types of things, right? And not getting too caught up in anything beyond that because it doesn’t matter to them.

Mandi Walls: Yeah.

Kat Gaines: They just want to know what can I do and not do? And when will I be able to do it again?

Mandi Walls: Yeah. And getting to the point in an incident where you actually know and can pinpoint exactly what is going on and how that’s going to be reflected in the customer experience is super important, as well. We can get a lot of things where things are wobbly, but it’s some deep backend process or not really sure exactly how it’s going to manifest for users. It takes a little bit of investigation to be able to say, “Okay, this is going to impact users with this capability in this data region, and here’s what it’s going to look like,” so that we’re not freaking everyone out and causing support more headaches than they need.

Kat Gaines: Oh, causing support headaches. That never happens in incident response, right? That’s a myth. It never happens. Now, I think that’s the thing, too. So, your support team or some customer facing team are often going to be your customer liaison. And I gave a talk at a conference a couple of months ago that was on this topic on support communication during incident response. And the thing about that is that your support team are going to be the people who are best equipped to act as translators between your engineering team and your customers. They have the understanding of your internal systems, all that stuff we were talking about earlier, the code names, the acronyms, et cetera. They know what those things mean, and they also know how to actually speak to customers like the human beings that they are in the human terminology that’s going to get them there, and being able to connect with what they experience and what they understand in the product. And so, I think there’s something to be said, too, for whoever your customer liaison is. If it’s support, if it’s a different role in your organization, fine. But taking cues from them when you’re defining your incident response processes in terms of what information do you need? How can you get information to them quickly enough, making sure they know where to go for it, making sure that the process for communication is well outlined and defined. I’ve even seen folks have system maps that connect the dots internally so that customer facing teams can connect the infrastructure part of the product to what customers see on the other side, which can be really helpful as well. And then, letting them lead when drafting comm. So you know, were talking about the executive swoop and poop earlier. Don’t want anyone else try and come in and say, “Oh, here’s what we should be seeing, here’s what we should be doing.” Because is it your job to sit there and talk to customers all day every day? No. If it’s not, step down, relax a little bit, and let the person whose job it is at least come up with the first draft, and the incident commander should be reviewing that for accuracy. The SME should have input on, “Here’s a detail that we think should get passed through.” But that customer liaison should be really driving those comms and whether they’re doing it fresh, starting from scratch, or they’re doing something we recommend at PagerDuty, which have templates. We actually wrote our response guide around templatizing responses, making sure they know where to start there, and that they feel empowered to lead that communication, too, even if it’s a template. A template is never a finished thing.

Mandi Walls: Yeah.

Kat Gaines: I think that we get a little bit stuck in, “Oh, well there’s canned communication, so we can just throw it out.” No, it’s not a finished product. It’s a template for a reason. That’s why we use that word. But making sure they know where to start, they can really finesse the messaging based off of that template. And then, they can be the ones to put that final product out there in the world and say, “Okay, we have communication for you that, again, you can digest, you can understand, and hopefully it’s not going to scare you too much-”

Mandi Walls: Yeah.

Kat Gaines: “… when you see there’s some things going on with our product.”

Mandi Walls: Yeah, absolutely. And especially for incidents that aren’t long-running, there’s probably no reason to get your corp comm people involved. There is probably a point or some catastrophic error that comes through where you’re going to want to have your PR people, your corp comm people-

Kat Gaines: Yeah.

Mandi Walls: … also muster to do more communications to an executive level or to the board or to the press or whatever. That is completely separate from any call that is going on on the technical side to fix things, right?

Kat Gaines: Yeah.

Mandi Walls: Those folks should have their own-

Kat Gaines: Call. Yeah.

Mandi Walls: Yeah. They got their own thing over there, and can be doing that asynchronously, while the SMEs and everyone else still runs the incident.

Kat Gaines: Yeah. But I mean if I’m even to put, I don’t know if it’s a hot take, but a take in there on PR communications or those long-running incidents, someone from your customer facing teams should still be involved there.

Mandi Walls: Absolutely. Yeah. Support leader or somebody. Definitely.

Kat Gaines: Yeah. It’s probably not the person. Let’s say you have a support engineer acting as customer liaison in the incident itself. Let them focus on that. That’s one job. If it’s a long-running incident, hopefully you’ve also cycled through customer [inaudible 00:19:36]-

Mandi Walls: Yeah, and rotated.

Kat Gaines: … people a little bit of a break. I had a friend recently tell me they’re on a small team and they brought PagerDuty to their support team that they’re part of, and they told me recently that they had a nine-hour incident at their company, and they were the only customer liaison on the call the entire time. And I almost cried when I saw that because it just sounded so horrible. And-

Mandi Walls: Oh, yeah.

Kat Gaines: … they’re basically rolling out PagerDuty at their company for support, and they’re hoping to avoid that situation in the future, but it just sounded so rough. So, yeah, don’t give your customer liaison anything else to do, especially if they’re hanging out for nine hours on one incident.

Mandi Walls: Right?

Kat Gaines: But have somebody from your customer facing organization, maybe your support leaders, maybe someone in your success team, again, someone who’s familiar with your customers and talks them on a daily basis. Have someone in that conversation with the PR and comms teams, too. Because if you leave them out of that, this is not to say anything against your PR and comms teams. They’re probably wonderful in doing a great job, but they just don’t know your customers as well. It’s just a fact. And so, you need someone who has those relationships who is going to understand here’s what’s going to resonate with people to balance out in the messaging with the higher level initiative, as well.

Mandi Walls: Yep, absolutely. I have definitely been on incidents. I was on one years ago where I had a wireless headset, and it was one of those fancy telephone ones that hooked into the actual phone in the office. But I was on the outage call so long, the battery’s died. And that isn’t fun either, because then I have to be on speakerphone and be that jerk in the cubicle farm with it on speaker phone.

Kat Gaines: Yeah. Yeah.

Mandi Walls: Nobody wants that. Work from home. It’s great.

Kat Gaines: Work from home, it’s wonderful. Work from home and have someone to hand off to, for god’s sake.

Mandi Walls: That’s right. Absolutely. So, once the incident then is over, there’s a whole postmortem process and everything that we go through. But part of that, too, is also there’s a bunch of post wrap up communications. Our mechanics are maybe a little bit different from what our folks do. One of the first things we do, the incident commander sends an email to a group email account in our email system for folks who subscribed to know about incident reports. And it’s just a summary of what happened, what the customer impact was, when the incident ran, how long it took, and what our final action was. Did we roll something back? Did we roll something forward? Did we turn something off? Whatever we did. And then, points them to what will, eventually, be the postmortem report, which is an actual document in the Wiki that then they can follow along as it gets added to. So, other folks are then able to join in with additional information. If they had to drop off the call during the incident, they can come back to the incident review and add any information that they had in there. So, we had, then, a document written about the component, and there’s plenty of ways to do postmortems and folks have lots of different processes and there’s lots of different templates and workflows for doing those. It’s just what we’re doing right now. And then, part of that will go to the public, which is also super fun.

Kat Gaines: Yeah, that’s always the fun part.

Mandi Walls: Right? That’s always the fun part. It’s like, “Oh, look at what bug crap crawled into this nonsense.” But not every organization has that practice of being super public about what they found during an incident, what they had to do to fix it. If there’s any lasting ramifications for users. I figured that if they don’t put that stuff out, somebody is still getting those queries. And again, for your support people, they’re probably on the receiving end of a whole lot of vitriol on that.

Kat Gaines: Yeah. And I think that’s one of the maybe most crucial pieces of advice for anyone listening who is maybe in one of those organizations where you don’t wrap up with an incident review of some kind and publish it externally. Start, because your support team or your other customer facing teams are probably suffering so hard as a result. They’re having to come up with language on the fly. They’re having to spend extra time basically holding a separate incident review just as a team to understand, “Okay, what do we do with messaging?” But then, they probably still have to go coordinate with the incident commander. And so, it’s just holding two incident. You’re just making it harder on yourself, basically.

Mandi Walls: Yeah.

Kat Gaines: It’s just an extra meeting that you’re going to have to have anyway. And if it’s not happening right now, it will. So, don’t say, “Oh, I’m lucky. We’re idyllic. We’d never end up with that.” Your customers are asking. Someone’s just not communicating the pain.

Mandi Walls: You just don’t know it is coming in, yeah.

Kat Gaines: Yeah, yeah, yeah. And I think, too, the other piece of incident reviews is making sure that talking about the communication itself is part of the incident review because things do go wrong with communication. Maybe someone didn’t use a template when they should have, maybe we had that executive swoop input, and it was hella distracting. Maybe somebody had a lot of big opinions about how we should say things and just an argument broke out in the call. All kinds of things can happen. Maybe we said something really stupid that we don’t want to say next time publicly. There’s no shame in that. It happens. We’re all human. But you do have to talk about it in the call itself when everything’s wrapped up, and you do have to make sure that you understand where it’s going next time.

Mandi Walls: Yeah.

Kat Gaines: You can have all types of beautiful things come out of that. You could develop your templates out of mistakes. You can say, “You know what? Whatever we’re doing isn’t working. Maybe either the people who we have writing customer comms aren’t the right people, or potentially we’re focusing on the wrong things when writing them.” That’s your time to refine, not just your processes for incident remediation itself, but for how you talk about it with the other human beings involved, whether it’s internally or externally.

Mandi Walls: Yeah, absolutely. Well, is there anything else that the people should know about?

Kat Gaines: I think we pretty much covered it. I think that what we really want people to take away here is that it’s an iterative process. It’s never done. You’re always going to learn from your incidents. Again, they’re hopefully rare, but embrace them when the time comes. I’m grimacing a little saying that. Our audience can’t see me, but there’s a little bit of a grimace because I just feel like it sounds hokey to say, “Embrace this opportunity to learn things.”

Mandi Walls: Right? It’s a gift, yeah.

Kat Gaines: If we’re in an incident call, we’re all going, “Oh God, this sucks.” Just the whole time. You’re not looking for the silver lining when you’re in the call if we’re being real. But we should. We should be able to look back at it and say, “Okay, here’s this cool thing that we learned from this incident that, yeah, we didn’t want this to happen in the first place, but now we know about this vulnerability that we have. Or now we know that this process we’re running just isn’t efficient. Or now we know that we have a really easy way for maybe a newbie to accidentally break something with their first deploy.”

Mandi Walls: Yeah.

Kat Gaines: “Maybe we want to fix that or write some documentation around it.” It has never happened to anyone, right?

Mandi Walls: No, no, never.

Kat Gaines: Or maybe our comms practices are not refined. So, it’s really that it’s a learning opportunity to be able to say, “Okay, here’s how we’re talking about things. Here’s how we’re approaching them, here’s how we’re talking about them, and here’s what we want to do better next time.” And I think the big takeaway, too, is that if you think your communications practices are perfect, look again because you have an opportunity somewhere. There’s something that needs a little bit of help and assistance. And if you’re ever in doubt, go back to your customer facing teams. They probably have opinions even if they haven’t expressed them yet.

Mandi Walls: Yep, absolutely. Well, I hope folks find this interesting. I’ll put a bunch of links in the show notes because we do have lots of documentation about our process and some of our recommendations and things that we hope will get folks started if you haven’t been thinking about this. Or you’ve seen other folks do it and you’re like, “We should totally be doing that. How do we get rolling on it? What do we have to do?” Well, Kat, thanks very much.

Kat Gaines: Yeah, thank you.

Mandi Walls: We’ll let folks go. We’ll wish everybody an uneventful day, and we’ll talk to you again in two weeks.

Kat Gaines: Bye, you all.

Kat Gaines: That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making the podcast possible. Remember to subscribe in your favorite pod catcher if you like what you’ve heard. You can find our show notes at pageittothelimit.com, and you can reach us on Twitter @PageIt2TheLimit using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.

Incident Communications With Kat and Mandi

Transcript

Show Notes

Additional Resources

Hosts

Kat Gaines (she/her/hers)

Mandi Walls (she/her)