The Incident Response Landscape with CTO Tim Armandpour

Transcript

Mandi Walls: Welcome to Page It to The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter. All right, welcome back everybody. This week I have with me PagerDuty’s own CTO, Tim Armandpour. Tim, welcome to the show. Tell the audience a bit about yourself. You’ve been at PagerDuty a very long time.

Tim Armandpour: Yeah, I’m serving as the CTO currently here at PagerDuty. Been with PagerDuty for the last eight-ish years, eight and some. It’s been a wild ride and one of those really unique opportunities that keeps making me feel like we’re just getting started in so many ways. It’s been a love fest since I’ve been here and I’m lucky enough PagerDuty’s kept me around for the ride.

Mandi Walls: That’s awesome. And we wanted to get your perspective. We wanted to share with folks what we’ve seen in the industry. Our team has been asked to participate in some new conferences about incident response, about reliability, and things from that perspective, and we wanted to get your thoughts. Have you seen this for the last several years? What are your thoughts on the state of incident response and trends that you’re seeing and why are folks suddenly so interested in this thing that PagerDuty’s been doing forever?

Tim Armandpour: This is one of the many things I’ve put in the bucket of why the founders were brilliant to start this company. They’re absolutely light years ahead of parts of the industry in order to elevate the incident response experience like a first-class status. And it’s a testament to the growth and success of our company. I think the success our customers have had when they leverage us, and not just us in terms of the product and software, but also us in terms of parts of our know-how, whether it’s through advocacy or our community and others that are now practicing what we call incident response more and more often.

Running trends fall into a few categories. One is we’re definitely seeing appetite for more automation and part of that is fueled by things that have been written a decade plus ago around how important it is to have consistency and predictability. Many folks, whether you’re an incident responder on-call, you’re an executive trying to make a big decision about how to invest in this area, you have consistency, you have predictability, and we’re all striving for aspects of proficiency. And that takes time and practice and activation of insta-response muscles into the mix. I think that’s both cultural, that is skills to be acquired, and experienced to be had.

Automation comes up a lot, whether that’s orchestrating the people, which has been a lot of PagerDuty’s bread and butter over the years, but also bringing in the advent of curated and understood and repeatable and durable playbooks and runbooks. And again, right in the wheelhouse of our platform offering. Applying machine learning and recognizing patterns and grouping things into some cohesive, actionable elements is also top of mind because the complexity factor surpasses the human factor and the human’s ability to keep up all these things.

It’s really interesting. All these things are starting to come together more and more and more, which is hence why I think there are now more conferences dedicated to this practice, dedicated to this topic. I like to say we were definitely ahead of the game, in a lot of ways, in terms of where we thought the puck was going to be, and in some parts it’s showing up. Yet complexity has also taken on a life of its own. There’s threat vectors and threat models showing up everywhere. When you talk about incident responses as a practice, there’s then the security incident response as not exactly a branch from the incident response practice, but it is sometimes a little bit different. There’s similarities, but it is sometimes different and with threats on the rise, I think that’s now come into the forefront while the responsibility for, let’s say, hardening a security response practice and capability within an organization sits with a [inaudible 00:04:25], the accountability to be part of that sits now across a lot of different groups because of the advent of a variety of operating environments.

You’ve got cloud, you have my own data center, my own colo facility is still… It’s not like those are gone and off the planet. Then you’ve got things in the middle with a hybrid environments, where I’m not going to move all my workloads to the public cloud. Maybe I’ll run a little private cloud and I’m still going to run and operate in my own data center. It depends. The advent of the complexity, microservices architecture, SaaS, everything, all this I think is now starting to come to a healthy tipping point that allows people to really, really get into the conversation flow and knowing that this stuff takes investment. It doesn’t come for free, but it does pay back dividends when you need it the most. There’s no doubt about it.

Mandi Walls: Yeah, definitely. What I’ve seen, just even what we invest in getting incident commanders online and we have those folks that are not all in engineering. I’m not sure you’d ever really want me as your incident commander, but I’m on call for it, so good luck. Fingers crossed. And I wonder too, some of the things you’re talking about, there’s organizational complexity as well as the environment complexity and wondering if that’s also helping us along in that we see fewer and fewer teams where all the operational capabilities and the reliability are outsourced to another team that’s no longer just engineers. And now, the engineers are responsible for the, “Cody, you run it, man, this is your stuff now.” And maybe getting a little bit more footage on fixing things and being more responsible for reliability, even though things are more complex than they were.

Tim Armandpour: No, totally. And even just to that point of, here at Pager, we abided by the mantra, if you build it, you code it, you own it type of scenario, and it’s just full stop in that regard. Put that on slide and say it over and over again. It doesn’t matter. You have to be willing to be able to put the practice into it, put the tooling, put the processes and the practices in place, build up the experience, and that’s where I think there’s still so much to be done across the industry when we talk about, “Go simulate the failure modes.” Well, where are you going to do it, in a small environment? “No, go do it in production.”

Mandi Walls: What?

Tim Armandpour: You got to think about failure modes upfront and your architecture and your design, your choice points along the way and be able to get people comfortable being in the hot seat when you do need to fix a thing. While machines are alive and well, it’s still very much a people sport at the end of the day. That’s not going away anytime soon.

Mandi Walls: Definitely, and like you’re saying, there are definitely folks who are a little less, I don’t want to say good at it, but not as skilled at it or not as practiced at it, and they fall apart when things aren’t going.

Tim Armandpour: Being good at it grows with, I want to say, time in the seat of doing it and feeling like you’re well-equipped to fulfill that part of the job. We can’t ask our people just to go do it. It’s more like, “How are we going to help surround you and support you with the right process, practice, experience, tools, insights, et cetera, to do that part of the job well?” And that’s where, again, I get super excited about PagerDuty just getting started in so many different industries and so many different operating environments and customer types and platforms and products out there, but there’s still a lot of ground to cover.

Mandi Walls: Yeah, definitely.

Tim Armandpour: And it ain’t easy.

Mandi Walls: Yeah, right. No problem. No problem. Job security. It’s all good.

Tim Armandpour: No big deal.

Mandi Walls: It’ll be fine. It’ll be fine.

Tim Armandpour: Just real quick, even on that front, I look at ourselves in the mirror a lot and look at myself in the mirror where here’s PagerDuty, one of the biggest value props we put out there for the industry and our customers is we’ll always be there when you need us. We are the dial tone. We’ll always be there. For us, we need to maintain that availability and the resiliency profile for our customer base while we’re also developing on top of a bunch of things we don’t control, like the cloud, telcos, and whatnot. Getting in that mode of getting really comfortable to architect and be in the art of masking failures and also being able to build up your muscle around how am I going to react to a thing, whether that’s an automated manner or “Oh no, there’s seven people that have to now collaborate and go fix a thing.” How do you do that where you minimize the impact to your customers where basically, nobody really notices but you. There’s a lot to be done to figure that out, as well.

Mandi Walls: There’s so many weird, moving pieces. So many of the strange responses I’ve been on about weird things that the telcos do and there’s one little bug in one library somewhere that takes everybody down for something else. And it’s just like, “Oh my goodness, this is a Rube Goldberg machine. How does the internet even work?”

Tim Armandpour: Exactly what it is. It’s like an endless loop of fun.

Mandi Walls: Of course. It is. It’s so interesting to pull back all those layers and see what all is in there and how things are working, and then somebody uses it in an interesting way and you’re like, “What are you even doing? But that seems really cool.” Coming up for us, we’ve got a lot of artificial intelligence and machine learning capabilities that are coming online. Are we getting to a point where we don’t have to get to humans anymore?

Tim Armandpour: Nah, I think the humans… We’re going to be in the loop for good reason. Here, the emergence of generative AI, it’s huge, it’s profound, it’s impactful. We can’t all qualify or quantify it yet, it’s such early days. But it’s exciting because of how accessible it is. That’s the true acceleration factor that’s shown up in the industry. All of a sudden felt like overnight, any one of us can now play with this thing. Whatever it might be, whether it’s through APIs or chat interface. The race is on by all these service providers to become one of several that will be these newfound operating systems for all of us. I think there are some types of things only really, really big companies can do, whether that’s self-driving car or commercialized space travel or literally open up the sheer robust capabilities of generative AI to the masses in as safe and as responsible manner as possible. With that said, given that it is early days, I think even with that technology, what we all find ourselves doing when we do play around with it is that you’re still in the business of validating and verifying the response you got. That’s not probably going away anytime soon. That human-in-the-loop part of it is a big deal, but there’s a lot of lift and leverage you get out of adopting parts of the automation effect that comes in with the advent of let’s say generative AI. Can you get a fast start to certain things? Can I get 80% of there really reliably, really predictably, really well, and then I’m now spending time on the higher judgment-oriented, engagement mode for the last 20%.? That’s where we’re going to be for quite a while.

Mandi Walls: Which still seems amazing, since a year and a half ago it seemed like we wouldn’t even have that.

Tim Armandpour: We didn’t even talk about it.

Mandi Walls: Yeah, wasn’t even on the horizon.

Tim Armandpour: Our Chief Product Officer, Sean Scott, says this often, and I think he nails it in terms of the analogy, but self-driving cars… A lot of companies, a lot of efforts got to 80%. That last 20% is really, really hard. I think we’re in a similar path with generative AI, it is one of the fastest adopted, fastest growing emerging technologies we’ve all experienced in a long, long time.

Mandi Walls: It’s exciting to see what folks will do with it and how it’ll help them smooth out their incident responses and that whole process and take some of the parts that feel onerous about working through an incident, whether it is, “How do I start this automation script” or “I don’t really want to sit down and write this postmortem, but it needs to be done.”

Tim Armandpour: Exactly. When we start to more readily leverage, let’s say, generative AI as an asset for this, what we can all strive for is to start to have more and more consistency and approach. I think that’s an important aspect of why is a company like ours investing in it. One is we still think there’s just not enough organizations and teams and people out there doing all the things that make up the wonderful world of incident management as a practice. Then how that starts to apply to other parts of a company at large, leveraging a platform offering, our operations cloud for example. And if you can create a little more consistency in approach, one, it’s less learning for an individual to do. You’ve got this ally or this angel on your shoulder and it’s always going to be there for you, letting you get to that 80% completion. Now all of a sudden, what does that acceleration mean for you in your daily life, for your team, for your organization, for your company at large? That’s where, when we’re able to really harness the power of what this enables, it’s a game changer. A really classic example is with post-incident review. This is one of those areas that takes people, people are not absolved of it. Why? Because there is some high-order, complex, judgment-like scenarios that come up in post incident review, both in the analysis, in the discussion and the conclusion and the follow-ups. We believe generative AI can definitely help all organizations get to a more predictable practice with respect to post-incident reviews such that you can close the feedback loop more effectively with your teams. And who’s going to benefit from that? Your teams themselves, absolutely. And then whomever you’re providing your service for. Why? Because there’ll be a high quality-of-service offering enabled over and over and over again. And can we take, I don’t know, hundreds of hours a year out of the workforce and enable by that 80% fast start and then have you focus on almost fill-in-the-blanks. They are arguably the most complex parts of the scenario you’re dealing with. That’s where generative AI can get really interesting, really fast.

Mandi Walls: Yeah, I think so, too. The process of writing a post-incident review is you’re collecting up all this textual data out of your Slack channel or whatever you’re talking about. You might be listening back through to your call recording. There’s all the mechanics there that not only a human is probably exhausted by, that’s just a lot to try and take in and distill down into the salient points and not miss anything, which then you give it to an AI model and it can pick those things out much more efficiently. And that takes that whole onerous part of that task out and gives you something more beautiful than what you probably would’ve gotten from most humans anyway.

Tim Armandpour: Absolutely. And then again, where we sit, we’ve got 14 years of all kinds of interesting data. Keep driving some of those responses, if you will. And then there’s publicly available data around the dependencies you have. How are they performing in that moment? How do we know from their status page? There’s still a wild world to start to combine these… These variable data sets come together and then really get the machines cranking, and that’s only going to get better with time.

Mandi Walls: I hope so. There’s definitely places where I see weird little holes. You still get telemetry from weird systems that are only… The output of their logs or their metrics are only for people who really understand that system really well. I’m like, “Can we put some semantics on this so that everybody can get some data and information out of these pieces?” So that when we do shoot it into an ML model, it’s like, “Oh yeah, this is all database.” And we can all just glom it back together.

Tim Armandpour: But Mandi, I keep getting told Kubernetes is going to solve all that.

Mandi Walls: Kubernetes, yes. Kubernetes. I should have a gong sound.

Tim Armandpour: You can mention it. It just can’t always be mentioned as a solution.

Mandi Walls: No.

Tim Armandpour: [inaudible 00:16:45]

Mandi Walls: Right? Absolutely. Everything was going to be solved by the cloud. And that wasn’t complex enough, so now we’re going to solve it with Kubernetes.

Tim Armandpour: Magic of the cloud is exactly that, it’s magical. It has magical powers.

Mandi Walls: No idea what’s going on under there. Absolutely no clue. Are there things that you’re seeing across our customers or the other folks that you talk to that folks should be looking out for that are causing folks to struggle or to have issues with getting to better practice or implementing better practices in their teams, for IR?

Tim Armandpour: Yes. As leaders, for example, we’re all dealing with some of the same challenge areas, which is even more constraints imposed, but nothing’s getting more simpler. With that in mind, whether it’s a headcount constraint, whether it’s a spend constraint with the next great tool or whatever it might be to trial and explore and figure things out, my systems, my business, my organization is not getting any more simpler. One of the unfortunate realities we all typically face, when you’ve got a list of priorities and you’re looking at what things can I, at least for now, pause or put on hold. Quite often, a lot of our go-to is going to be things in the quality bucket because, “I still need to get this new thing out of the gate and so I can cut some corners on quality.” Now, when it comes to incident response as a practice and as a cultural ethos, it takes investment. And so when there are companies and people out there… We talk about resiliency management, resilient architecture, reliability management, so on and so forth. The time is now. It’s fantastic. It’s all needed. And the key question is how can you afford to cut in those areas when ultimately, if you don’t know that the heart is beating, that’s a problem. How do you know how to restart the heart, at the end of the day? It does take practice and investment, and many are challenged with understanding that cost and investment in value quotient because until you’ve gone through a really dark moment with your service offering because of a change got deployed or because of an unknown dependency factor or because of something, you name it, this is not your immediate go-to. Those that have really unlocked that and have that as a steady-state part of their business, part of their way of operating and doing things, are most well-prepared to literally grind through how fast things are changing for us. We still invest in what we had coined many years ago, before my time, Failure Fridays. It started as a specific two-hour window on a Friday. This is by design, to do it on a Friday, because everybody’s afraid of doing changes on a Friday. Two-hour window on a Friday, where a controlled-failure scenario is actually getting worked as a major incident response in the live production environment. And here’s PagerDuty doing that where we have to be the dial tone, we’re always supposed to be up and running, et cetera. We were rarely perfect at it, but we learned a ton through that. And then we started to build in a practice around that where things like, “You’re going to be a tier one service in the architecture and you’re ready for prime time. Have you gone through your Failure Friday scenarios? Oh, not? Then, no, you can’t get to a hundred percent traffic.” And then how does that start to translate over the years? We’ve grown, and we’ve grown in different ways, both from a maturity perspective and a complexity factor, and it’s really fun to see where a lot of teams are now graduating to this mode of failure any day. It’s part of the heartbeat, but you’re practicing the muscle where you are specifically injecting failure, in a controlled manner, into the environment in some manner in order to learn about your thing. And you’re not just learning about the software and the systems and the dependencies and the interdependence of parts of the architecture, you’re also learning about your people. How well-equipped are we to handle that? Where’s the knowledge? Is it documented? That thing you did, is it part of a runbook? That process we just ran, is it predictable? Is it durable? Every single time that something like this shows up, and so it’s just this constant, continuous improvement mode like [inaudible 00:21:21] over and over and over again, but it takes investment. We believe it’s a worthwhile investment that pays dividends because we’re able to provide a much higher quality of service for our customers. All the teams aren’t always on fire because of some production noise, so to speak. It ebbs and flows, but we’re not drowning, but we’re also, I’d like to think, pretty proactive about doing that because we feel it’s important for the benefit of our customers. When I dream a dream, all organizations are adopting these things are better and teaching the community back so that ultimately everybody benefits. Don’t wait until the house is on fire to go find the water. That’s not the way of doing it, in my opinion. That is a challenge area, in that the first area to cut is the quality bucket. Let’s not do that as best we can.

Mandi Walls: We see that a lot where that stuff kind of gets dropped until the next big failure and then, “Oh, it has to be resource-

Tim Armandpour: Reacting to the next big failure.

Mandi Walls: Reacting, yeah.

Tim Armandpour: And it’s funny. When I first got here, we were starting to put together the practice around incident response training and practicing, incident command as a true persona and a practice to take on. And we modeled a lot of things out of the movie Apollo 11, with Ed Harris, mission control. We got some really great Wiki pages around this place still sitting around. But if you think back, with him as incident commander and the rest of the supporting crew and mission control, we’re not well-prepared to react without overreacting or underreacting, but react in order to solve for a common thing, we may not have had those folks survive the Apollo 11 mission and come back home safely. Egregious example, but there’s so many parallels, in terms of how that showed up in the movie, to where we’ve always thought that that’s how we need to operate. That’s how we want to get to, but it doesn’t happen with the flip of a switch. You got to make it a thing. That for me is still very top-of-mind when I talk to customers, everyone, and even us at times, we tend to struggle with this. That’s an area where there’s still both challenge and opportunity out in the wild. As companies still grow and there’s still struggle around how do different areas talk to each other and actually actually collaborate on things, this is where again, I think the magic of our experience comes to life where we help create a common language across disparate groups. If you are a CTO and you’re responsible for the mean time to acknowledge, mean time to know, mean time to remedy, any of your mean time certain things… At an aggregate level, that’s great, but how are you enabling your teams to actually, positively contribute to that? You got to create a common layer and a common way of doing things so you can standardize and build up the proficiency. Otherwise, you’re going to have too many hotspots in the grand scheme of it. But that’s still a challenge area, especially as teams want to be able to do it their respective way, from the very top, because their ability to serve their customers is paramount. But getting to the point of the true collaboration, the true working arrangement across disparate teams or silos, if you will. I’m a big believer if you can put automation at the center of that, you give yourself a really fighting chance. You give yourself a high degree of leverage so that you don’t run out of wall clock time too often.

Mandi Walls: Yeah, yeah, absolutely. You’re speeding up expertise with that automation.

Tim Armandpour: You become somewhat people-independent of that. You still need the people, but if people move within teams within that or they leave your company, you don’t want that knowledge leaving the building, either, every single time. Because it gets expensive to replace it really fast. But those are areas where customers have challenges. Everybody’s got similar challenges. Let’s just keep doing stuff about it.

Mandi Walls: Yeah, awesome. Well, this has been great. Thank you so much for joining us this week.

Tim Armandpour: For sure.

Mandi Walls: Is there any parting nugget that you would leave with our audience out there? Some piece of wisdom maybe that you’ve learned over the past eight plus years of doing this?

Tim Armandpour: Change management, to get really, really good at incident management at large, is really important. I’ve learned to not underestimate both the risk and investment required to make that happen, but the more that fellow leaders and peers can be really convicted of the importance of getting good at this, just like you want to get good at architecture, you want to get good at software development, you want to get good at quality assurance, you want to get good at product design. You got to put the time and effort into it, and I like to always say here at PagerDuty, you’ve got the benefit of 20,000 customers, 14, 15 years in the making. We’re always here to help, even if it’s just to talk shop or point to references we have or other colleagues around the industry to talk through this stuff. Always here to help. We’re in this together. One team, one dream.

Mandi Walls: Awesome. Sweet. Yeah, we’re always here, too. We love to talk to customers. Our team loves reaching out and helping our customers with stuff. Love to hear it.

Tim Armandpour: Thanks for having me, Mandi.

Mandi Walls: We’ll sign off now. We’ll wish everybody out there an uneventful day and we’ll be back in a couple of weeks. That does it for another installment of Page it to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at PageItToTheLimit.com, and you can reach us on Twitter at PageIt2TheLimit using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.

The Incident Response Landscape With CTO Tim Armandpour

Transcript

Show Notes

Additional Resources

Guests

Tim Armandpour

Hosts

Mandi Walls (she/her)