Incident Response to Incident Management With Jeli

Posted on Tuesday, Mar 5, 2024
In this episode, we welcome Nora Jones, Founder and CEO of Jeli, which PagerDuty acquired in 2023. We talk with Nora about expanding incident response into incident management and learning from incidents to improve reliability.

Transcript

Mandi Walls: Welcome to Page It to The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter.

Alright, welcome back folks. The pager to the limit. This week I have with me Nora Jones, founder of Jeli, and if you have missed the news, PagerDuty did acquire Jeli late last year, so we’re very excited to have the Jeli team on board with us at PagerDuty. It’s been very exciting. We’re really looking forward to bringing all of their stuff into the PagerDuty platform for all of you out there. In the meantime though, Nora, introduce yourself to the people. What’s interesting about you?

Nora Jones: Thanks Mandi. Yeah, hi everyone. I was the founder and CEO of jeli.io. Like Mandi mentioned, we got acquired by PagerDuty late last year and I’m now a senior director of product at PagerDuty really focused on post incident reviews and some of the collaboration experiences you have inside and outside of incidents.

Mandi Walls: How did you get into that? It’s such an interesting niche. I feel like a lot of people sort of struggle with that, folks on the podcast in the past who talk about blameless postmortems and all that kind of stuff, but folks still struggle with this. How did you get into this sort of thing?

Nora Jones: Yeah, I mean I’ve been interested in risk my whole life really. I had a degree in computer engineering and I immediately joined a home automation and security company, but I was focused on the risk side, which the risk side of your home automation and security system going down is pretty major and that was how I started my career. And so the implications of system failure were quite large and so I was really taking it very seriously from the beginning. But one thing I noticed was how acutely focused everyone at my organization was on the technical side of how things went wrong rather than some of the ways that led up to it. And so it was something I noticed, but I was not sure totally how to fix it and so I started doing some research on my own. I started trying to invite folks from other departments to incident reviews to talk to them about situations and it sort of evolved from there.

So in every job I was in after that I was involved in incidents in some way, shape or form, and I was always trying to understand the human side of it and how the way the organization was structured might’ve contributed to the incident as well and really trying to use that to improve and improve everyone’s working environment. I mean, we go to work, it’s a place where we spend a lot of our time. I think we all kind of want to do a good job at our work too, but we’re all very different humans and so I was really interested in finding a way to help humans work better together despite the differences and the way they live their days and help them use that understanding to make them more productive and feel like they’re doing great work.

Mandi Walls: Yeah. What kind of things do you see with folks? When we talk about some things that happen during incidents, I talk about the organizational structure sort of comes through sometimes we talk to folks about executive swoop. Executives are kind of want to be helpful, but they’re kind of distracting and you really wish they got off the call. What other kinds of things do you see

Nora Jones: During incidents, it’s always really interesting. I’ve been at a lot of large organizations and there have been times where I’m meeting someone for the first time in the middle of an incident and having to work with them and also having to direct them on what to do. And so there’s not really a sense of trust that we’re able to build yet, and so building that into the incident process so people feel comfortable executing on things and building that trust so quickly to stop the bleeding. It’s a really difficult thing to do, but it’s something organizations have to be very intentional about outside of incidents in order to be successful at it,

Mandi Walls: Like you say, to be intentional. Are there programs and trainings for things like that or is it something that you grow like a muscle as you practice?

Nora Jones: I think it has a lot to do with the psychological safety that is present in an org to begin with just how people talk to each other and engage with each other in everyday work and then how you bring that culture to an incident. Because in an incident it’s like you’re not dancing around things. You have to be very, very direct and very authoritative, more so than you might be with that person outside of the incident too. Part of the training is just letting people know, Hey, when you’re in this situation, you have your incident hat on and here’s how we talk and here are things that are not useful to say during incident. Here are ways that are not useful to interact during incidents and really giving people some guidelines to follow when they’re in the midst of an incident so that they know what to expect.

Mandi Walls: And I was a sysadmin for a long time and a lot of folks just really don’t have the compartmentalization to deal with incidents really well. They freak out, they just kind of close down. You can’t really concentrate on what they’re doing and that just compounds, then, the interpersonal issues that they might have with someone who’s trying to get them to help with something. It doesn’t help the whole situation.

Nora Jones: I think folks have to be really comfortable in the org themselves has to really practice what they preach. Hey, you’re going to really direct attention and wording at you. It doesn’t mean you’re in trouble. And the person that is giving that direct attention and wording also has to make sure that they’re not thinking that this person did something wrong. And so you have to make sure you’re baking in assuming good intent on both sides. If there’s this subtle layer of passive aggression, it’s not going to work out well either. But I think you’re absolutely right. I think there’s some people that have a really, really hard time with this and I think when you think about incident reviews and then when you think about incidents, they are very different hats to wear, right? When you’re in an incident review, you want to be a methodical, you want to be inquisitive, you want to be an investigator.

When you’re in an incident, you want to be direct, you want to be authoritative. And I’ve seen people that are very good at one and not good at the other and vice versa, really great incident commanders that do not do well in an incident review because they’re so authoritative in the incident review as well, which is not what you necessarily want in that particular moment. And so it does take a skillset. I really like what you’re bringing up there. It takes a skillset that you do need to train and really invest in order for it to go smoothly.

Mandi Walls: Yeah, definitely. And we’ve kind of talked to it in some past episodes. I had John Allspaw on, but it’s been three and forever an internet lifetime since we’ve talked to John. But along the same lines, the state of practice of instant reviews and learning from incidents and I think has improved I would say rapidly over the past several years. It’s had a lot of attention on it. People are now thinking about it. I think in a much more, I would say professional way. Maybe they’re not all methodical, but they’re actually taking it as something they need to be doing, which I think maybe they weren’t for a while. When you’re working with incidents, we’re working with folks who are learning this process, what are some of the things that folks should focus on first if they feel like they have a pretty psychologically safe organization that feels like level set for anyone doing any work these days, what else needs to be there to have a really productive discussion for your post-incident review?

Nora Jones: And I just want to go back to what you said earlier. Yeah, we have dramatically improved as an industry in the past few years, and I also think we’re very young as an industry. But yeah, when we were starting out Jeli, I think there was a notion very commonly held in the industry. We don’t have time for post-incident reviews and it’s like, of course you don’t have time for the incident and it still happened, so let’s acknowledge it happened and try to squeeze something that we can from it. But I think first and foremost is that is dedicating time for the post incident review and not breaking your promise to learn If there’s another standing important meeting that’s at the same time as the post-incident review, make time to try to figure that out so that you can really attend and be there. Because a lot of the value that comes out of the incident reviews too is not just someone investigating what happened and writing it down.

It’s like the block of time where people are talking to each other and engaging with each other about what happened and about what their point of view was at that particular moment. And it also really takes a skilled facilitator to provide that kind of collaborative environment and really make sure they have a good agenda that they want to touch on going into the meeting, being able to get people off of their soap boxes if they’re talking for 30 minutes, all of that takes a lot of skill and focus as a facilitator to really extract all those data points out of all those people. Because the really interesting thing about the review meeting is it’s not like the incident review gets done before the meeting itself is another source of data into your incident review. It should never be done until the incident review happens.

Mandi Walls: Absolutely. We find lots of stuff that either gets remembered or we add another lens or a little bit of additional context when we’re talking through it in the review meeting, even though everybody had an opportunity to go through the document, the template beforehand and add their whole story and we have the timeline and there’s all these things and there’s always something that comes in like, oh yes, we were thinking about this as well that comes in. That adds sort of another dimension to at least some part of it.

Nora Jones: Absolutely. And one thing that I think is very common is people are so busy and they’re usually doing the things that they were doing before the incident took place and they’re scrambling and so sometimes when they get to the incident review, they actually forget what has happened because, and so I think a really skilled facilitator will also be very good at reorienting people to the situation so that they can have a productive meeting and really jogging their memory. And I think you can really do that with visuals. And so it’s going to be hard if you’re just like you have a confluence doc up on a screen and you’re following a what went well, what went poorly, which is a very common way to do this, but it doesn’t always jog the memory of here’s what happened and really jump people into that situation. And I think that’s incredibly important to do

Mandi Walls: And there’s such an investment you said about don’t break your promise to have the meeting. We have a timeline after an incident when you’re going to handle the review meeting, it needs to be within five business days so things don’t fall out of your brain. They probably have gone anyway even after five days. But then what’s the life of that review afterward? What should folks be thinking about doing with the artifacts that they produce in those meetings?

Nora Jones: I think they should treat these artifacts as if someone’s going to read them like six months from now, a year from now, I think if you invest a good amount of time in them, they can be one of the best onboarding guides that you can have for new employees. They can be a really great source of information of how people and systems work and interact with each other. And I think just really taking the time to document what a thing is, this particular piece of technology fell over instead of explaining why it fell over and how it fell over, take a good chunk of that document to explain what that technology even is. That is such a good use of an incident review, and I don’t think people always see it like that. They almost use the incident review as a way to explain away the situation when that doesn’t really serve longevity in the org.

Taking time to explain how something works and how it’s supposed to work well and then how it broke can really be this nice educational read. I think one of my favorite moments of my career was I was in the office at Slack and I saw someone curled up on the couch in the office reading one of my printed out incident reviews. I was like, this is fantastic. They were taking time out of their day to just flip through it. And I think when you invest a lot of time in it and really make it a narrative and educational rather than explaining why it kind of changes how people engage with it and the value you can get from it.

Mandi Walls: Who do you think of as the audience for those then? I kind of think of them primarily for the team that’s creating them, but its probably everybody else too, I guess.

Nora Jones: Yeah, I think this is part of the issue. People have a hard time doing a good job at these because they’re thinking of so many audiences at the same time. We’ve run a workshop several times at Jelly for various customers, and in the beginning of the workshop we show this publicly facing RCA and one of the very first questions we ask is we ask people to read the whole thing and then we ask them who they think it’s for, and everyone has a bunch of different answers. And it’s because it’s not totally clear who it’s for. Is it for the customer? Is it for the colleagues? Is it for the board? Is it for your coworkers? And I think you have to think about each of those people when you’re writing it. And I think for the internal ones, you should think of it as your colleagues, as the audience, as you should think that someone from CS might read this.

Are they going to be able to understand it? You should think that someone from leadership might read this. What do they need to understand? You should think that an engineer might read this. What are they going to have to understand? And so I think what it can really do is if you’re thinking of all those colleagues as part of your audience, it can really increase the understanding that they have of each other’s worlds. But then there’s the publicly facing ones too, and then there’s the ones that your execs have to give to their boards. And so I think you can use this internal one to guide some of those, but the time pressure between when you need to deliver some of those and when you can deliver this is sometimes at odds as well. And so I think a good facilitator and writer can really work with that and understand that there’s going to be different layers that need to understand it.

Mandi Walls: I think one of the benefits of this whole mechanism becoming a little bit more mature and a little bit more accepted across the industry is that I think customer support people are bit more conscious of, yeah, we want to get this out. We want to get something that it makes sense for the customer. They understand the impacts and it’s in their sort of language, but also on the business side, CorpCom people I think are more attuned to doing a better job at this as well. Getting in there and like I say, talking to the board maybe if you have to talk to press and having those folks in there with the right language at the right time.

Nora Jones: Totally. And having the person that’s writing the internal review, also interacting with that person and really learning from them. It brings up a really great point, Mandi, when you understand it, you then also have to communicate outwards and you have to use language that provides confidence but is also transparent at the same time, and it’s a really tricky balance to strike.

Mandi Walls: Yeah. Do you find these, are they harder now that we’re working in more distributed environments or that just the additional complexity that was going to be there anyway?

Nora Jones: It’s a good question. They’re different. I think one of the good things about being in remote environments is that all your resolution of the incident tends to be recorded, whether it’s recorded in Slack or in teams or in a Google meet or in a Zoom, you have access to how people were speaking to each other when they were taking pauses, what they were looking at in various situations. And so you can use all that data to actually understand the coordinative parts of how an incident unfolded, which can really be used to point to the technical parts that might need fixing.

Mandi Walls: It feels cool to say we were in the command room or whatever during the incident, but you don’t necessarily have the best high fidelity recording of what happened if you’ve got 40 people in a room trying to solve the issue versus 40 people in Zoom and chat at same time.

Nora Jones: Totally. And so I think there’s a lot of benefit to improving our incident reviews because of that. There’s always benefit. I think if we were in the office together and something was on fire, I would probably run up to your desk. You know what? We wouldn’t necessarily jump in a Zoom or a Slack room because it would be faster and I would be showing you on my computer what I was seeing. And so I think we’re missing some of that high touch resolution of incidents. But I think what is a big benefit to the remote world is the ability to train people and to really improve your organization outside of the incident as well, because everything is recorded.

Mandi Walls: Sometimes you miss those days, but other times it’s like, oh no, we have this full story and no one has to sit and consciously try and remember every minutia that happened. We were on a conference table while things were on fire.

Nora Jones: Totally. Yeah. There’s something really exciting about resolving an incident in office, even though it’s stressful, but I totally feel you. I remember the Netflix command center, it was like this conference room with just monitors along every single wall, and it was just like we would just hunker down in there until a thing was resolved.

Mandi Walls: Well, there too, you miss a celebration at the end, all the zoom ends and the chat ends and somebody gets to sign the incident review. But if you were all in the conference room at the end, somebody’s manager is ordering pizza and you’re going to have a little bit of downtime and recovery time. That is kind of nice.

Nora Jones: It’s kind of nice, but you also get to be in the comfort of your own home now in the middle of an incident. So if you’re going to be working late and stuff, at least you’re already at home rather than having to eat dinner in the office.

Mandi Walls: True enough. Don’t miss those days for sure. A hundred percent. Other things that you’ve been working on in addition to Jeli, you also had the Learning from Incidents, which is learningfromincidents io, and that’s another whole community around all this stuff. Talk about that a little bit. What’s there for folks, and is it continuing now that you’re with us?

Nora Jones: All great questions. And one of the things you said at the beginning was how much the industry has taken this more seriously in the last several years. And I think no small part in that is due to the people in that community because they’re across all these different companies really doing grassroots initiatives in their companies to get them to think about incidents differently in the post-incident review. And I started the community several years ago, I was still at Netflix and I was trying to do that kind of work at Netflix and I wanted to talk to someone about it that didn’t work at Netflix. I wanted to see if anyone was doing this at another company and what some of the change management aspects because a huge technical side to doing incident reviews better, but then there’s this whole other change management side to doing it better.

And if you’re one person doing both of those things at the same time, it’s a lot of work and it’s really difficult to do both of those things well. So I was like, is there anyone else doing this or caring about this? And I shot out a tweet and that was how the community started. And I started a Slack community after that tweet and it just grew bigger than I ever imagined it would. We have hundreds of people in it chatting about these grassroots initiatives. And then we try to keep the community small so that people feel comfortable sharing what they’re going through in their organizations and getting advice from each other. And so we try to keep it small enough that people know each other and there’s names and faces associated with it. And last year we ran the first learning from incidents conference to allow folks that were not in the Slack community to also participate. And so I mean, I started it before Jeli, so I think I’ll always be passionate about it and investing in it in some way, but I think it’s also communities kind of take a form of their own and I think it really has, and I think that’s really awesome.

Mandi Walls: Yeah, it’s definitely cool. As folks join and you get to know them better, where does this function sort of sit in an organization? Ours, it’s falls to SRE mostly. Is that sort of where it lives for folks that are thinking, oh, we should pick up a little bit more of this. We should mature our process a little bit. Whose responsibility does it become?

Nora Jones: I think one of the coolest parts of building Jeli from an incident nerd perspective, which just getting to see how all of these companies did it across the industry, and a lot of them do it very differently. And I think the right answer is what works for your org? I think you want to have people in charge of this program that are able to be influential in the organization. They hold a certain level of respect from their colleagues. People want to go to meetings that they’re running. People feel these people are technical and they can be very transparent in the details with them, but they also feel like they’re folks that try to gain trust with people and they’re social. And so I think those are some of the most important attributes of someone running a program like this rather than needing to be an SRE or a program manager.

Mandi Walls: Absolutely. And I’ve seen other folks that compare sort of your post incident reviews also to an agile retrospective that some of those skills also crossover and some of those folks might find interest or a little bit of enthusiasm for that as well. There seems like a little bit of cross training that happens there as well.

Nora Jones: And I think one of the really hidden benefits of investing a lot in this work as an individual, I think it’s best if it’s in addition to your job rather than your whole job because you learn so much about the system that it kind up levels your career in that organization because you suddenly become the person that people go to because you know so much you’re like this wealth of knowledge. And so I don’t think people always realize that when they’re signing up to do this or being told to do this. But I’ve seen someone like that was struggling for years to get promoted from senior engineer to staff engineer. And after investing a lot of time in facilitating incident reviews, they got promoted, not because they were running incident reviews, but because they suddenly knew a bunch of stuff and they were like the guy, and people were like, oh, how do I do this thing? And they would know the answer to it. And so I think it’s like if you have a bunch of people doing that in an org rather than one person, you’re going to kind of uplevel your entire org. And so everything’s going to get more efficient, smoother, more collaborative. And so I think there’s just a big hidden efficiency and productivity benefit that I really wish all executives and leaders knew about because I think it’s a huge competitive advantage.

Mandi Walls: When else do you get an excuse or a legitimate reason to sit and do a deep dive on some part of your system, even if this is stuff that you wrote, you’ve got deadlines, you’ve got feature requests, you’ve got all these things going on, things have to go through the pipeline, and you don’t often get a chance to sit back and take a day or a couple of days to go through all the minutiae, all the weird bells and whistles and dig around and all that stuff. That just doesn’t happen. So yeah, it’s crazy opportunity.

Nora Jones: It is, yeah, it’s a really great opportunity to just really just help productivity in your org.

Mandi Walls: Yeah, definitely. So one thing I would like to ask folks on the show is are there any myths you’d like to bust or any pet peeves you might have when you’re talking to folks about this kind of work?

Nora Jones: I think my biggest pet peeve is, especially in this remote world, if they’re in the middle of an incident, they’re not posting necessarily what they’re looking at to come to a conclusion. So I think my biggest pet peeve is when someone just appears very knowledgeable in an incident and you’re like, how did they do that? But they’re not necessarily writing the steps out of how they did that. I don’t know if it’s really a pet peeve more than just a, I think we would all be better for it if we kind of knew the nitty gritty details of what they were looking at. How did you get to that graph? Did you make this query? Understanding things like that I think would help create more strong engineers in the midst of an incident.

Mandi Walls: Absolutely. Show you’re working. What are you doing? What are you digging in? I whatcha

Nora Jones: Digging at? Yeah. Yeah.

Mandi Walls: I know. I’ve been guilty of that myself during things. Sometimes you just get so deep into the tactics, you forget that you need to tell other people what you’re doing in case they can either add some additional context or flag you off if you’re headed in the wrong direction for stuff.

Nora Jones: Totally. Yeah. And just showing how you got there. I mean, I think that was in new jobs was if I was in an office, I would just walk up to the person’s desk, ask them if I could watch what they were doing. Yeah, because you learn so much rather than just the output. You see the process of how they get to the output.

Mandi Walls: Absolutely. It starts out, it kind of looks like the matrix, but eventually you get to pick up what they’re working on and where they’re headed and how they’re thinking. It’s great. Oh, folks that just don’t want to do this. And I’m like, I get it. It’s tiring. It’s exhausting. And sometimes you get into, I feel like we get into waves where a lot of things go cattywampus for a while and you kind of get stuck and people get exhausted, and I totally get it. I feel like you want to approach every incident, at least at first in a bubble without context of any other incident, just come in and say, here’s what happened this time, and then link them together later. And folks get a lot of fatigue if I feel like if they got a lot of things they feel like are connected, even if they haven’t done the investigation yet, to know that they are.

Nora Jones: Yeah, I feel that.

Mandi Walls: I mean, I don’t recommend anyone ever have a seven figure outage.

Nora Jones: Don’t wish that on my worst enemy. No,

Mandi Walls: Not even a little bit. Not at all. It’s just not the life to live. So as folks get better at all of this, it’s really nice to see folks improving this practice internally here at PagerDuty, but also all of our customers that are working on this and getting better as a customer of many of them. I appreciate it.

Nora Jones: Yeah, a hundred percent

Mandi Walls: Makes it so much nicer.

Nora Jones: Yeah, it’s like we’re all improving as an industry together. Yes,

Mandi Walls: Absolutely. So where can folks find you these days? Where are you on the social medias?

Nora Jones: I’ve been posting on LinkedIn more so you can find me there. And then I also still occasionally post on whatever it’s called now, Twitter X. You can find me there too. And that’s at nora js. I made the username when I was learning no JS and I thought it was funny and then never changed it. And so I think I have a lot of random front end people following.

Mandi Walls: Totally understand. My handle is still Linuxchick after 25 years. So those days are mostly in the past, but Well, this has been great to close us off, what are you looking forward to now that Jelly is part of PagerDuty? Is there anything you can tease out for us that might be coming or things that you’re looking forward to sharing with the PagerDuty community?

Nora Jones: There’s so much. I think what I’m really looking forward to is just the connection pieces between Jelly and PagerDuty and really completing that full feedback loop. I am just so excited about the data that we have access to now and the ability, because I think with more data we can help drive more improvements for customers. So I think customers can really expect to see more investment in learning from their incidents.

Mandi Walls: Looking forward to it. All right, Laura, thank you so much for joining us. Thank you out there for listening. We’ll be back in a couple of weeks with another episode. In the meantime, we’ll wish you an uneventful day

Mandi Walls: That does it for another installment of Pager to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pager to the limit.com and you can reach us on Twitter at page it to the limit using the number two. Thank you so much for joining us and remember uneventful days, our beautiful days.

Show Notes

Additional Resources

Guests

Nora Jones

Nora Jones

Nora Jones is a Senior Director of Product at PagerDuty, and the former founder and CEO of Jeli. She is a software engineer and leader with 10+ years of experience at innovative companies including Netflix and Slack. Nora’s focus on the sociotechnical aspects of engineering — the intersection between how people and software work together in practice in distributed systems — is a founding pillar of Jeli, as well as the Chaos Engineering movement, which Nora helped build from the start. She is also the founder of the Learning From Incidents community (learningfromincidents.io).

Hosts

Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.