Mandi Walls: Welcome to Page It to The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability, and the lives of the people supporting their system. I’m your host Mandi Walls. Find me at LNXCHK on Twitter.
Mandi Walls: All right. Welcome Page It to The Limit. Today we’re going to talk about incident communications. Sharing updates with your team, your management, and your customers is an important part of managing an incident. I’m Mandi Walls, we’re joined today by Alina Anderson from Smartsheet. Alina is a Senior Technical Program Manager for Observability Engineering at Smartsheet. Welcome to the show Alina.
Alina Anderson: Hi Mandi. I’m super excited to be here.
Mandi Walls: Oh, fun. So we’re recording live in our booth at PagerDuty Summit. We’re going to talk a bit about Alina’s session at Summit, and some other things. So to get us started, how would you describe incident communications, for anyone sort of new to it, or doesn’t really know exactly what it means?
Alina Anderson: So the way I think about incident communications is essentially getting the right context to the right people at the right time. So often we’re hyper-focused on solving the technical problem, but the team responsible for telling customers the issue’s resolved didn’t get the memo. And so they’re just kind of waiting, and ultimately the customer suffers.
Mandi Walls: Yeah, definitely. How did you get started? Like how did this get onto your radar?
Alina Anderson: Our organization at Smartsheet, we have scaled rapidly. We’re a hyper-growth company. Our technical response did not keep pace with that organizational scale. And last year we had a major outage, and this really broke open the gaps across the company, in where we didn’t have process in place. We didn’t have the communication infrastructure in place, for the entire organization to react to a major incident like that.
Mandi Walls: And as you’re thinking about those kinds of incidents and the way those things happen, at what point did you start thinking about the actual communications part? Like, was that part from the very beginning? Is it maybe the reason to look at it in the first place?
Alina Anderson: Any response to any incident is communication to your customers. You are making a choice on how you’re communicating to your customers, regardless of what processes you have in place, or don’t have in place.
Alina Anderson: So the lack of an external update to customer about service status, because maybe you don’t have a team where that’s their job, is communication. But unfortunately, what it’s communicating to your customers is that their pain isn’t important to you, that they can’t trust you, that you are not earning their loyalty.
Alina Anderson: And oftentimes too, within your company, you have field teams. So you sales or CSMs, and they’re frustrated, and they don’t feel like engineering cares about their pain, because they’re having to hop on a call with a customer who’s angry and frustrated. So the lack of an ability to update the right people at the right time is communicating oftentimes what you don’t want.
Alina Anderson: So it’s really about pivoting that, to create intentional and proactive communication channels that are communicating the message that you want. Which for all of us, I think I’ve heard a lot of different themes at Summit this week around trust and loyalty. And you know, now we’re in this 24⁄7 everyone’s depending on their SaaS tools. And so everyone, when something’s not working, everyone wants to know what’s going on.
Mandi Walls: Yeah, absolutely. I mean, you’re in a point now where saying nothing says a lot about [crosstalk 00:03:52].
Alina Anderson: Yeah. Yeah.
Mandi Walls: … really thinking about the people that depend on you.
Alina Anderson: Right. Yeah. And so I guess for me, it’s not, “Oh, now I’m going to create this communication process.” It’s no, you’re already communicating, but it’s figuring out a way to more effectively communicate and land the message that you want.
Mandi Walls: It sounds like you went through the whole process sort of in a combined way with your new incident response, and all of these kinds of things. Like how has sort of improving your communications, has it helped you improve your overall response process, and all the other things that you’re sort of thinking about with that?
Alina Anderson: Yeah, in my experience, you know, Smartsheet has really great culture, and I love the people that I work with. And so often you have a group of really well-intentioned folks, they’re working their tails off on their task. And we overlook the bigger picture, which is how information travels throughout the organization. How information has to go from one team to another, for the other team to do their job.
Alina Anderson: And so that was really where we really dug in and tried to figure out what roles and responsibilities make sense, and how can we divide things up in a way so that we’re removing dependencies, and we’re making those boundaries more clear. So that the technical team can focus on, Hey, I just need to make sure that I get these bits of information out. And then support and other folks can run off and do their end of it. Kind of like a relay race, I guess, is an analogy that I would think of.
Mandi Walls: Yeah. Those handoffs are pretty important. You want that to go well. So what other teams did you involve in this? Is there a role to play for folks outside of your engineering organization [crosstalk 00:05:34].
Alina Anderson: Yeah, absolutely. So we have a communications response team that gets spun up for major incidents. And so we have representation from support, folks for social media, they’re monitoring customer signal.
Alina Anderson: So they’re ensuring that we have a really close eye on, are we getting the signal from support cases or social media or field teams, that kind of thing. And they’re crafting any messaging that needs to go out, or any framing. And so the technical team is really able to just focus on addressing whatever the service issue is, and getting it back up as soon as possible.
Mandi Walls: Yeah. That’s awesome. That’s like a whole other sort of set of experience, or set of expertise that you’re bringing to bear on the whole process.
Alina Anderson: Yeah. And I think, into working on the presentation, there’s one slide that actually really captures that. And it was really that moment of seeing what we see day to day, in a picture and like, wow, there’s actually a handful of work streams that to the splinter off. You know, there’s a technical team, but their information that they pass onto the business teams, they all are running off on their separate work streams, across all the different touch points we have with our customers.
Alina Anderson: Ultimately what we want is the customer to have a consistent experience across the organization, on a particular issue that they’re concerned about. So they’re getting the same communication end message, whether they’re reaching out to support, or their CSM or their account manager. And those roles within the organization also feel equipped to have that conversation with the customer.
Mandi Walls: Mm-hmm (affirmative). Do you send them through special training? Are they internally certified for that? Or …
Alina Anderson: That’s actually a really good idea. Sounds like a good next year roadmap project. You know, I think right now we just have a lot of people that really care about the customer experience, and are really engaged. And we are just figuring it out as we go, and continuing to make iterative improvements.
Mandi Walls: Yeah. As part of that whole learning process, what’s been surprising? Has there been anything that you really weren’t expecting to find while you were doing this work?
Alina Anderson: What has been surprising to me I think, is that a strong incident response culture actually requires personal development. Like teams, a culture of psychological safety and continuous learning. And it requires everyone to attempt to keep a cool head, to kind of manage under really high stress conditions, and to remain kind.
Alina Anderson: And when something’s broken, it’s very easy to … you know, for someone to snap or point fingers, or that kind of thing. And my observation, like I didn’t really quite make that connection until now, looking back, and setting the people first foundation, actually allows your processes to iterate much more effectively. Because it gives you a North star to really focus on, and is the process serving the people? Is the process serving this culture that we need? You know, and not just about having a great tool, it’s the people part I think that has been the most surprising to me, in terms of how that can make or break your process.
Mandi Walls: Yeah. If we can dig into it a little bit, was that something else that you needed to really work on with your folks? Did you have that culture naturally already, or is there other exercises you’re doing?
Alina Anderson: Yeah, we, Smartsheet already had a … you know, we’ve won Best Place to Work a bunch of times in Washington. And we already had this starting foundation. The challenge has been maintaining that as we grow. Going from a hundred engineers to now, almost 400 across the globe.
Alina Anderson: And so how do you maintain that culture? And how do you keep that blameless PIR practice going? And as you grow, it requires a little bit more formalization in training, and you can’t quite rely on that small group of people that are just all on the same page. You have to translate it through the organization a little bit more intentionally.
Mandi Walls: Can you share with us how you’ve done that? Are you doing seminars or internal training?
Alina Anderson: Yeah, so we have a weekly operations review. And this is actually been a huge opportunity of learning for our organization, because we pick a few post-incident reviews every week, and the whole engineering managers and ICs and leaders get together, and we just talk through like, okay, what did we see here? What learnings from this can be translated out to all the other teams?
Alina Anderson: And so it’s a culture of learning around, okay, what did we learn here? And if there is … everyone often knows, sometimes there was a human error that causes an issue. And the dialogue is not, Oh my gosh, how could that person possibly have made a mistake? But it’s where in the system are we allowing human error to impact our customers, and how do we remove it? Keeping the focus there really enables I think people to feel free to be honest, and be transparent around where the gaps are, and not try to hide or cover anything.
Mandi Walls: We talk a lot about this with the blameless postmortems work that we do. So in the workshops and the ops guide around that stuff. Did you find in your organization that people were receptive to it? Did they [crosstalk 00:11:03].
Alina Anderson: Yeah, absolutely.
Mandi Walls: Yeah?
Alina Anderson: And I think, as if you’re a scaling company and growing company, and you become more of a melting pot, with folks from all different kinds of companies, which some people come to Smartsheet because where they were before was definitely not blameless. And they didn’t like it, right? And so I think it’s on leaders within the engineering organization to demonstrate the behavior, and to ensure that we are making it safe for someone to say, Hey, you know what? Yeah, this is what happened. Then we can collectively say, okay, how do we automate this in a way that’s not requiring the one single person who got the one single email notification about the thing to do an action?
Mandi Walls: Awesome. Sweet. Well, okay. So one segment that we always have on our episodes in the podcast are debunking a myth. So talking about incident communications, and expanding it a little bit more too, like we can pick into blameless postmortems, or some of the other things we’ve talked about. Is there any myth, or common misconception, or anything about some of the things we’ve talked about that you come across from, and you feel like you’re always telling people the same thing?
Alina Anderson: Yeah. I think, especially as I get deeper into the specific area, incident communications is not marketing’s job. It’s not, “Well marketing, that’s their problem,” or, “They’re going to figure it out,” or … Like the speed and quality of information that comes out of the technical team is critical, is absolutely critical. You know, on the real areas idea, you are the first one out, and the faster and more accurate you are of enabling all those other teams within your organization, the faster the customer … The customer is going to be maybe even delighted that like, “Wow, you even told me about this before I noticed it. And you really have my back.” Especially for any customers that have IT admins as a key customer.
Alina Anderson: IT admin has to answer to everyone in their organization. And it’s not fun for all of your teams to be pestering you, like, “What’s going on with this tool that we have?” And you’re trying to ping the vendor, and the vendor isn’t communicating. So we put those IT admins in a really uncomfortable place, if we’re not effectively doing this.
Alina Anderson: And so I think the myth that this communication is like, Oh, it’s education, it’s not technical. You know, actually instrument so that an engineer can automatically push required information out to other teams that can go do their thing.
Mandi Walls: Have you found your customers are appreciative of the changes you’ve made? Have they reported back that they approve of all this stuff?
Alina Anderson: Oh, absolutely. Our CSM teams and support teams are very excited. And they’re able to self-service, and instead of just pinging a guy who knows a guy about the issue, they’re just able to go and check and say, “Hey, what’s going on here?”
Mandi Walls: Awesome. That’s fantastic. All right. So now we have a couple of other things we like to ask folks. Two sort of side by side questions. So the first one is, what’s one thing you wished you would have known sooner or earlier in your career, when it comes to this whole practice of running software and production, and all the fancy business that goes along with that stuff?
Alina Anderson: I wish I would’ve known sooner that these are really tough problems that no one has actually solved or has the answers to. You know, for a long time I felt like I was behind the curve or Oh, I just … we must … this must be a self problem. And then now, especially getting involved with the PagerDuty community, and a lot of great vendors; we work with Datadog and [inaudible 00:14:53] and AWS. And it’s like, wow, so many organizations are grappling with these same challenges.
Alina Anderson: And for me, I think the current state of the world and the new remote workforce, the next generation of innovation is going to come in this communication space. This is where, in my opinion, the leading edge is, because we’re going to have to get communication out to our users faster than ever before. Companies are depending on our products to run their businesses. They can’t afford downtime. And if there’s something impacted, they need to know immediately. And so I think I’m excited to see sort of what that next gen of tools and products looks like.
Mandi Walls: Yeah. Have you seen a lot of changes in the way your team communicates? Has is help them out as everybody’s been affected by a work from home, and all this other changes? Were you guys mostly remote before, or-
Alina Anderson: It helps surface where we just have verbal communication, and we’re not actually documenting or having something run through a tool. Because it’s tougher to just walk by somebody … You know, you can’t walk by somebody’s desk and get the update on the incident, right? Like, there actually has to be that practice and that discipline around documenting the exact live status, what’s going on, and all those types of things.
Mandi Walls: There’s no red light attached to the wall at the end of the cubicle now that everybody can look at, right? [inaudible 00:16:28].
Alina Anderson: Yeah, right. Right, right.
Mandi Walls: Or not. All right. Last question then, is there anything about all this stuff that you’re glad we didn’t ask you?
Alina Anderson: My backlog.
Mandi Walls: Oh.
Alina Anderson: There’s so many exciting things. And especially with PagerDuty announcing some new features, like the support product is really interesting to me. I mentioned this in the Q&A in the talk yesterday, where in this space, when you’re working on communications for an organization, it often feels like two steps forward, one step back. You can definitely feel like, Oh, we’re not making progress. But if you step back and you look at it, you can often, Oh yeah, we actually are moving forward. But the nature of the beast is just that there’s always a huge backlog. There’s always huge improvements. There’s always kind of a new set of people that want to be informed.
Alina Anderson: You may have a new leader that comes in and says, “Hey, I need an SMS to my phone when this happens.” So you’re just sort of on that continuous improvement, and you just have to be, Okay, what is that incremental improvement that maybe would get something out two minutes faster, or five minutes faster? And sometimes that’s huge.
Mandi Walls: Yeah. Is there anything that for organizations that want to get started in this, is there anything you’d recommend they start looking at, or where they should start?
Alina Anderson: Yeah, my recommendation is take your next PIR, Post Incident Review, and as part of that timeline, layer in your communication or lack of communication in there.
Alina Anderson: So when you say, okay, detection was it this time by this whatever tool. Okay, when did we actually declare an incident? Okay, when did we page in this team? Okay, when did we notify support? Or, Hey, support wandered into our war room and asked, “What’s going on.” Capture those things in there, capture the first customer communication that went out.
Alina Anderson: So once you’re able to sort of start looking at that, and then you can say, “Whoa, it took us 36 minutes to get an update to our strategic customers.” And then that gives you a baseline. Because you may actually be like, “Wow, okay. We did great in 10 minutes, and we’re all pretty comfortable at 10 minutes. And now we just need to formalize what we’re already doing.” Versus, “Wow, we’re not comfortable with 30 minutes whatsoever. And what are we comfortable with? Do we want to try seeing if we can do this five minutes faster next time?”
Mandi Walls: Yeah. Excellent. You got to start with the data as it is-
Alina Anderson: Yeah.
Mandi Walls: … and then figure out your [crosstalk 00:19:13].
Alina Anderson: Because sometimes perception and reality are very different.
Mandi Walls: Absolutely. People are screaming on Twitter. It may have only been a couple of minutes before that they’ve noticed it, rather than any real delay.
Alina Anderson: Yeah, yeah. Right. And sometimes it’s setting expectations. If there’s a certain audience that needs a notification instantaneous, that’s going to be a different tooling notification, than if it’s 10 minutes or 15 minutes.
Mandi Walls: Yeah, absolutely. Well, this has been great stuff. We’re at the end of our time. For folks listening to our podcast here, Alina’s session Incident Communications Made Easy from PagerDuty Summit is going to be available on demand. Check out the summit.pagerduty.com website for that. And eventually those things will move to YouTube, depending on when you’re listening to this.
Mandi Walls: But thanks for joining us. Thank you Alina, for coming along today.
Alina Anderson: Thanks, Mandi it’s super fun. I could talk about this stuff all day. I love it.
Mandi Walls: Love it. Absolutely amazing. So thanks everybody for tuning in. This is Mandi Walls wishing you an uneventful day.
Mandi Walls: That does it for another installment of Page to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes at pageittothelimit.com, and you can read this on Twitter at page it to the limit using the number two. That’s Page It to the Limit. Let us know what you think of the show. Thank you so much for joining us. And remember, uneventful days are beautiful days.
For a full transcript of this episode, click the “Display Transcript” button above.
How does your team handle Incident Communications? Having a plan for how you are going to keep your customers and stakeholders informed, and who is going to do it, is important when incidents happen. Listen this week as Alina Anderson, Senior Technical Program Manager for Observability at Smartsheet joins Mandi to talk about Smartsheet’s journey to finding a communications plan that works for their teams.
Alina is a Technical PM at Smartsheet, software trusted by over 75% of Fortune 500 companies to operate mission critical business processes. She is an expert at cat herding through complex challenges involving humans and systems. On the weekend, you will find Alina volunteering on a crisis hotline, wandering through a farmer’s market or hosting family dance parties.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.