Runbook Automation With Jake Cohen

Posted on Tuesday, Sep 6, 2022
Embracing automation in distributed systems is key for reaching scale and efficiency. This week we talk to Jake Cohen about runbook automation, what it means for teams, and how it creates opportunities for automated diagnostics.


Mandi Walls: Welcome to Page It To The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system and liability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @lnxchk on Twitter. Hi, folks. Welcome back. This week I have with me, Jake Cohen. Jake, welcome to the show.

Jake Cohen: Thanks for having me.

Mandi Walls: So tell us about yourself. Tell us about what you do at PagerDuty, and a bit about why you’re here today.

Jake Cohen: Sure. So a bit about me. I came from the San Francisco Bay Area, much like many of us at PagerDuty, but ended up down here in Santa Barbara, because that’s where I went to school, was at UC Santa Barbara, spent a few years while in undergrad working in IT, and that led to me working in this industry. I got my first job at LogicMonitor, one of the SaaS-based monitoring companies. They’re based here. And then I knew some of the original Rundeck folks, and after a good tenure at LogicMonitor, roughly six years, they said, “We’re doing really cool stuff,” and I really liked what I was hearing in terms of what they were doing around self-service automation. I could really understand value props, and having been on the sales engineering side of the org for a while, understanding the value props was the most critical piece to me in terms of where I decided to go with next. So that’s what eventually just drove me to PagerDuty, was the PagerDuty’s acquisition of Rundeck, and I joined just a few months after the acquisition. Now, I’m working on the product management team for the process automation product suite, focusing on solutions and integrations built on top of the process automation platform.

Mandi Walls: Awesome. So our topic today is broadly automation, but more specifically what we’ve come to know as Runbook Automation. In your mind, how would you describe Runbook Automation for anyone who’s thinking about it and maybe isn’t familiar with that kind of automation as a practice and what they can get out of it?

Jake Cohen: The term runbook does come from a very old school idea. I think it actually comes from the mainframe days where you would have literally a run book that was a set of tasks for a given procedure and it was tasks written on paper. And then as things became digitized, runbooks in IT and operations, and even in engineering became wikis, and docs, and confluence articles, and that sort of thing, but it all still was based off of this notion of, in a given situation, there is a known set of steps for a given procedure, whether that’s to restore service, or make an upgrade, or provision access, or provision infrastructure. The very simple concept of Runbook Automation is while we have the ability to automate these steps and we can provide tooling to make it clear what each step is in that process, which is different than just, say a script which automates the whole process end. We can actually break up this end-to-end process into these automation primitives. So that’s the basic concept for us are the OG Rundeck folks, and for us still on the process automation team. It’s more than just that. It involves the ideas of standardizing a lot of the operations around these processes, whether that’s standardizing logging, compliance, access, version control, key storage access. But also then what many who’ve known Rundeck in its older days is around the self-service nature of it. When we can put a more simplified user interface on top of the invocation aspect of the runbook, then you don’t need to be a domain expert in that particular process or more so the underlying technology involved in that process. And so for us, we think about this in terms of there’s the automation piece, which is automating the old school wikis. There’s the standardization, which is that whole compliance area. And then there’s the delegation aspect, which is being able to pass off that automation to someone else in a safe manner.

Mandi Walls: They are all very important in the composability of that so that lots of teams can use all those things. As you work with folks, what are some common reasons that folks come across as automation? It seems like it’s a layer of tooling that gets brought in at a particular time. Are there places that you see it be more successful than others? Do you have to get to a certain place before you’re ready for Runbook Automation?

Jake Cohen: Yeah. Good question. So the patterns that we have seen is that it tends to get brought in when a team gets above a certain size where the organization is structured in certain ways where there are… It usually falls into one of the two, I’d say maybe three categories that I just outlined where they recognize that they have so much toil. They have so many manual processes. That the savings from just automating those, even if the people controlling the invocation and authoring of those automations are the same people who were doing them manually previously. So it’s usually at scale in that sense in terms of number of these toil type activities, or it’s when the teams or organization get to a certain size where there are such dependencies between people, and those dependencies are driven by a delta, typically between a given individual’s or team’s capabilities or access, or even tribal knowledge in the org. In the book, Working Backwards from Amazon, they talked about how they were able to break the dependencies between teams by effectively emulating what we now think of as standard APIs. So they could say a team provided some way for other teams to access resources that that team basically provides as a service. So a very good example here is where engineering needs to be able to perform very specific actions in production environments. And so what the production operations team does is provide these very specific automated runbooks in the engineering team, and it’s like a team-to-team API, if you will. And that typically happens when an organization gets to a certain size and you have that division, a separation of concerns between an operations team and an engineering team.

Mandi Walls: So it can be pretty broad. So you have a lot of potential projects and a lot of things that folks might be looking to implement with this kind of automation. Do you see common hurdles that teams have to overcome to achieve what they’re looking for with Runbook Automation?

Jake Cohen: Yeah. So you may have picked up on this. The number of use cases for this type of automation is very broad, and a lot of that has to do with the nature of our flavor of Runbook Automation, which is a tool that can wrap around other expert-level tools. I believe the most common integration for our customers is using our process automation product on top of Ansible. But then there’s the incident response use cases. I mentioned patching, and so on and so forth. So most common hurdle I see is customers trying to identify where to start. They have these big ambitious goals of self-service, delegation, and compliance, and standardization across all of those areas that I mentioned, but that’s usually an audacious project, especially for a company of a large size, and so where they struggle is trying to figure out where to start. And so incidentally, that’s been a big area of focus for us, which is steering people towards this use case of automating diagnostics, pulling diagnostic data as part of the incident response process. And there are a variety of reasons for that, but the reason for that is that it’s very specific and it’s in a known space of incident response and can demonstrate the core principles of self-service automation and Runbook Automation without needing to say let’s be ambitious and start with an org-wide enterprise layer of automation. Does that make sense?

Mandi Walls: Absolutely. And the automated diagnostic stuff is really interesting. It’s a place where you’re in an incident, there’s something going on, there’s something going wrong and you want to minimize the impact of that, and minimize the duration of that. So let’s dig into that a little bit more. What fits from of the Runbook Automation side of it into the automated diagnostic side that’s going to help people with their incident response and managing their availability that way?

Jake Cohen: Yeah. So what’s interesting about automated diagnostics is we deliberately chose that for a couple of reasons. We saw that as the logical avenue for pushing customers as being their first use case for a couple of reasons. And the first is certainly that notion of, well there’s the series of things that people do when they get paged and just begin the classical runbook to Runbook Automation analog. And there are things that people do to identify what is the root cause or to validate if it’s a false positive, or there’s the things that the domain experts do when they get pulled into an incident. And so when we looked at that and we realized, okay, there’s an opportunity here, not just to save time in terms of the duration of the incident, but perhaps even more importantly or even more valuable to the customer is we can reduce the number of people that need to get pulled into an incident by performing the steps that those domain experts do to verify whether or not it’s their domain that is causing the problem or is part of the problem. It’s almost like deductive reasoning, so it’s deducing, is it a database issue? Is it a platform issue? Is it this service or is it that service, and can we therefore provide that first responder with intuitive output of these diagnostic debug level checks that are typically performed by the domain expert in such a way that they can say, “Okay, I don’t need to pull in the MySQL person. I don’t need to pull in the RabbitMQ person.” And so that’s where our head is at with this use case.

Mandi Walls: Yeah. And with the additional components in the platform having the authorization, authentication, all those kinds of components there that provides you with separation of duties and all the other things that folks might be worrying about thinking about, “Oh, I have my responder. Do they have the right access for things?” Your automation platform has the access that it needs for all that stuff.

Jake Cohen: Right. So it ties into exactly those earlier fundamental principles of Runbook Automation that we were talking about earlier. And one other thing I’ll mention here, I was going to mention this, which is one of the other reasons that this was an elegant place for us to start in terms of where we’re swaying customers, especially those who are familiar with PagerDuty and its core value propositions as well, is that what we’re trying to do is get the right people involved in an incident. And I would like to think that Alex Solomon and the other co-founders of PagerDuty, my understanding is that was the basis, which is we don’t need to call in the cavalry every time there’s an incident. We can be more precise with our response and our response plays. And so this is all about that as well, which is, can we reduce the number of response plays, for example, that a user needs to run in PagerDuty or something equivalent if they’re not using PagerDuty, which is, can we informed them that they don’t need to pull in all of these people every time? So it is really core to the PagerDuty fundamental principles.

Mandi Walls: Yeah, definitely. Not every incident has to be all hands on deck. And fortunately, we’re seeing fewer and fewer folks still doing that, but they’re still out there. If you’re listening to this and you’re still doing that, give us a call. We can help you figure all that stuff out so that you’re only getting the people that you need onto an incident.

Jake Cohen: Yeah. It’s one of those big habits die hard, something like that.

Mandi Walls: Yeah. Yeah. Old habits die hard. Yeah. Definitely. Folks are still holding onto that. For folks who are new to all of this, is there an overlap between what they might be using, say a monitoring solution for and what they might be getting out of an automated diagnostics provider on these kinds of things?

Jake Cohen: Yeah, and that’s a question that has come up already with customers, and it’s one that we foresaw, especially myself having come from the monitoring space, and actually two of our other product managers on this team, having worked in the monitoring space previously too. And it can be confusing. The way that we try to simplify it down is that monitoring is meant to notify you when something has gone wrong, whether that’s by a rule-based alert threshold and anomaly detection rule. Observability tools are incredible these days with what they’re able to pick up. Monitoring’s purpose is to let you know that something’s gone wrong. That’s its primary purpose. Its secondary purpose is to help you with root cause analysis, and again, that deductive reasoning. We see automated diagnostics as, regardless of how good your observability is, let’s emulate what the people do when they do get paged, when they do get brought in, because inevitably, regardless of how good your observability might be, you still might then go into that observability tool and look at certain charts, or logs, or graphs, or what have you, or you might perform certain ad hoc queries from the monitoring tool. Some of these monitoring tools provide that now. And so the way I would say is that a lot of times, monitoring tools don’t necessarily provide the same level of debug diagnostic data that domain experts would go and retrieve during an incident, but even if they do, let’s go and retrieve that data from the monitoring tool and surface that immediately to that first responder, or translate that data from the monitoring tool in a format and put it into content that that first responder can comprehend and do something with. So it’s easy to see where that overlap is because we might be leveraging on monitoring, but ultimately, it has to do with the core purpose of it, is to, again, emulate what those first responders are doing, and regardless of whether or not that data is in the monitoring tool.

Mandi Walls: Yeah, and take it then to the next level of what the humans would do and work off from there. So, interesting. For some of the more complex environments that folks might have, if they’re working in, say Kubernetes or other containerized environments and all those kinds of things, do you find that the principles of the automated diagnostics and Runbook Automation, are things still pretty much the same across those environments for folks that are using those kinds of things?

Jake Cohen: Yeah, that’s a good question too, because again, we saw this in the monitoring space too that things really are different when you are talking about environments that are based off of containers and container orchestration. And the reason for that is that there’s so much self-healing already baked into container orchestration, and the whole premise is a stateful declaration of desired states, and that sort of thing. And so with containerized environments, the lot of times, always the goal is service restoration, but with containerized environments, you can restore service so quickly that many times there isn’t necessarily this whole triage operations that you would see with classical infrastructure environments. And so what we have seen with customers is that this notion of diagnostics for triage isn’t as valuable to them as being able to basically capture container state, instant that an incident starts and capture that container state, store it, and then restore service also at instant speed or at incident speed, and then use that captured container state for real root cause analysis. And when I say real root cause analysis, what I mean is there’s root cause analysis to help with restoring service, and then there’s root cause analysis to determine and identify what caused the problem in the first place. And so a good example we have there is performing like a core dump or a Java thread dump of a container, because again, that’s not something that a monitoring tool is typically doing, and unless you have some scheduled task going on, it’s typically not happening. And so our idea there is, well, for these containerized environments, as soon as that incident occurs, let’s go and capture whatever that forensic evidence would be for a developer, allow you to restore service just as quickly, but then provide that debug level data for the engineering teams to figure out, “Okay, was it this one container that caused itself to crash? Was it a combination of two or three containers working in cohesion with one another that caused the problem, all of the complexities of container environments?”

Mandi Walls: Right, because sometimes things get spun back up so fast that you don’t have a chance to investigate and see, okay, is this something that’s going to happen again without having that muscle memory to know where to go in and retrieve things and have them available? So, super interesting because that environment can be moving so fast on its own. So, super interesting there. And you actually showed us some of this on the Twitch channel a couple weeks ago. So we’ll add that link in the show notes for folks who want to see how that works, because it’s super interesting to take a look at that stuff.

Jake Cohen: Yeah, that’s right. I’ll go ahead and do that.

Mandi Walls: Yeah. So bringing us back to Runbook Automation as a whole, how do you see this space evolving? We’ve come a long way from very small shells scripts in a directory somewhere. There’s a lot of really comprehensive enterprise-level features in the process automation product and the other things that folks are working on here. What ways are these spaces going to continue to evolve?

Jake Cohen: Yeah. Well, I think ultimately, what I alluded to earlier is this notion of APIs for teams. And so what we’re doing on the incident response side is saying, “Okay. Well, let’s make that dependency between a first responder, whoever that might be and the domain experts, let’s try to cut that dependency. That direct dependency, if you will.” And what we’re looking at is the different analogous use cases for customer service operations. There are so many times where customer service needs to rely on, whether it’s technical operations, or engineering to perform certain things, to help customers. And then there’s other departments within the org that whether it’s FP&A or strategy that needs just reporting data from, again, operations, user data for analysis, you might have a data science team that has a dependency on, again, production operations teams. And we’ve seen some customers doing this really well already, but my hope is that the evolution of Runbook Automation is that it becomes this thing that you don’t even see working in the background that allows teams to work where they work today, but that magically things are just happening faster and better for them. So if I’m a customer service representative and I created a ticket to retrieve some information or do something for some customer that typically took a week because it was a ticket in someone’s queue, as soon as I submit that ticket, or as soon as that ticket gets approved, that kicks off this Runbook Automation invisibly to that CS rep and makes their lives easier, makes the customer’s lives easier. And so automated diagnostics is a nice easy way for us to demonstrate that core concept. But I think at large, what we’re trying to do is create that seamless integration, if you will, between front office and back office operations. In this case, front office is not so much front office, but more just first responders versus domain experts, but in that customer facing role, that customer service reps, that is really what we’re talking about there, is that front office, back office.

Mandi Walls: Awesome. So it gives you this opportunity to start encapsulate all of the expertise that all your folks have and present it to everyone else. They don’t have to know all the details. They just have to know which job to run and when to run it, and off it goes, and the magic happens.

Jake Cohen: Exactly.

Mandi Walls: Awesome. So to wrap up today, we have a question that we often ask folks on the podcast and that is to debunk a myth. So is there a favorite myth or common misconception that you might have about Runbook Automation that you want to debunk for folks, things that you hear over and over again that we just want to set the record straight?

Jake Cohen: Yeah. The myth that I would say around Runbook Automation is this misconception that it is similar, or analogous, or identical to a script, or to a function specific tool or task-specific tool. And again, this is where the difficulty comes. Kudos to our marketing and sales teams for conducting these conversations with the market. But you have these task-specific tools for deployments, for patching, for infrastructure provisioning, and we promote this notion of standardizing that and delegating that through self-service. And I think the confusion stems from the fact that technically, you could build the automation primitives in our tool, but rationally, most of our customers want to wrap around those domain specific tools. And so I think that the misconception, especially from the more technical users from engineers, is that, “Why would I need this for me to do what I do? Because I can do all of this using whether it’s Terraform, or Ansible, or a Python script.” And so it’s trying to up level that conversation to saying, “No, no, no. It’s not for you to do your specific task better. It’s for your team to, again, whether it’s standardized or delegate those tasks.”

Mandi Walls: Awesome. So a couple other things that we like to know occasionally. So is there anything that you wish you had known sooner when it comes to Runbook Automation? You’ve been working on this stuff for a while. Is there anything that you had to learn the hard way?

Jake Cohen: Yeah, so I got enamored with the value propositions of Runbook Automation. Like I mentioned, the original Rundeck crew and the people that I knew working on that team and that project, and those original value props of standardizing across an organization and self-service automation, and self-service delegation. And I started here at PagerDuty on the solutions consulting side, which is the equivalent of a sales engineer. And so hyped on selling those value propositions. And in smaller customers, it was fairly straightforward, I’d say, to hype them up on these and these value propositions and then get started. But in any organization of, let’s say mid-market size, even small mid to on up, it is this struggle of where do we start? You have to know that you have to start somewhere small. It’s really impossible to say, “Okay, we’re going to implement the standardization layer across everything right now.” That’s a behemoth project. That’s near impossible. And so it was recognizing, okay, the way that we’re going to help customers really adopt this is start with a very specific task or a specific domain area maybe for just one team, and that will demonstrate the principles, the customer, and we will learn a lot about how to do this their way, and then we’ll move to the next team or the next use case. And then after three to a half 1000 of these use cases, then the customer and us have learned enough about what this would look like at scale, that we can start implementing those at scale pieces of the puzzle, if you will.

Mandi Walls: Yes. Yeah, definitely. That’s a lesson a lot of us learn, I think, as we go into different customer sites to implement things. And we talk about starting small, working out all the kinks and finding the weird cobwebs in the infrastructure that are maybe in the way, and you get to a point eventually where everybody’s amazed and looking for the next team to work on as people sign up and they’re just waiting to reap the rewards of the things that you’ve already figured out.

Jake Cohen: Exactly.

Mandi Walls: Is there anything else you’d like to share with folks before we sign off today? This has been really good for anybody who’s looking for information about Runbook Automation.

Jake Cohen: What I would say is keep your ears peeled for more on this automated diagnostics theme. As you can tell, I’m definitely an evangelist of it because I think it helps solve this problem of where do you start? I think it’s an underestimated use case for Runbook Automation and we’re going to be releasing assets and content around it, but as well as product, more and more, and more productizing the solution to be out of the box for our customers. Keep your ears peeled from more on that. And people can certainly reach out to me if they have ideas or questions. I believe that my LinkedIn and GitHub contact info will be provided through the podcast.

Mandi Walls: That will be on the website. Absolutely. We’ll have some links in there where folks can learn more. And with everything PagerDuty-related, you can follow us at the PagerDuty Twitter for more information. You can follow our blog for more updates and product information and all that good stuff too. So things will be posted there as they get released for folks that are interested in learning more. And this has been great, Jake. So thanks very much for joining us today.

Jake Cohen: No, thank you. This has been great.

Mandi Walls: So thanks for everybody out there listening. This is Mandi, and I wish you an uneventful day. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes and you can reach us on Twitter @pageit2thelimit, using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.

Show Notes

Additional Resources


Jake Cohen

Jake Cohen (He/Him)

I grew up in the SF Bay Area and then went to UC Santa Barbara for college. My first paid job was working an IT Helpdesk at one of the grad-schools at UC Santa Barbara. I’d always enjoyed working on computers - and this helped me learn the basics of IT. That job also provided the connection to land a position at LogicMonitor (infrastructure monitoring for IT teams) right after graduating. I started on the sales-side, spending most of my time there as a Sales Engineer. I spent a lot of time building custom-solutions for customers as well as helping build internal tools for our teams. After roughly 6 years there, I made the move to PagerDuty to work on the Process Automation products - since I had known some of the original Rundeck team that were part of the acquisition by PagerDuty. After a year on the Solutions Consulting team, I transitioned over to the Product Management team working on our Solutions and Integrations. That brings us to the present!


Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.