Kat Gaines: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host Kat Gaines, and you can find me on Twitter @strawberryf1eld using the number one for the letter “i”. All righty. Hi folks. Welcome back. So today we are with Iris Carrera, who is a senior site reliability engineer at Dutchie and Iris, if you want to just tell us a little bit about who you are, what you do, the floor is yours.
Iris Carrera: Yeah. Thanks for having me. I’m Iris, senior site reliability engineer at Dutchie. I’ve been working in DevOps infrastructure, site reliability space for over five years and I’m based out of Seattle.
Kat Gaines: Awesome. I love Seattle. I was there just a couple of months ago. It was one of the only trips I’ve taken during the pandemic and definitely the right choice.
Iris Carrera: Nice. Good to hear.
Kat Gaines: Today, folks, we’re going to talk a little bit about being new to incident command and I think it’s a daunting thing when you’re are first getting into it. Iris, I bet you’d agree. You don’t just kind of show up one day knowing all of the things, what to do, who to talk to, how to direct people. And it’s really high stakes when you’re doing incident command. If something happens, you are dealing with things that might have executive, customer eyes, everyone is looking to you to be the source of authority but I’m going to back up just a little bit and get us started just for anyone who, for the uninitiated talking about what incident command is and what that means. So for anyone who really wants to do a deep deep dive PagerDuty has docs detailing all of the pieces of incident response at response.pagerduty.com.
Kat Gaines: We go into depth about all the different roles in including incident commander. So if you want to go read for quite a while, you can do that. But for the short version, for our purposes today, the incident commander, which will also often refer to in shorthand as the IC is the person who is responsible for directing traffic basically, when there’s an incident, when something is not going to plan when multiple teams have to quickly be mobilized and pulled together just to set things right, again, maybe some outward facing communication needs to happen. Definitely some inward communication needs to happen and folks need to work together to get things back on track.
Kat Gaines: Everyone has to start somewhere, as I mentioned earlier, you don’t just show up one day knowing how to do the thing perfectly. And sometimes your first experience as an IC can be a little bumpy or a little bit more intense than you anticipated. And Iris, I think you have some experience with that. So tell us about your first week and a little bit how things went down.
Iris Carrera: Yes, just a little background. I started at Dutchie in October and then I joined the incident command rotation. And so my first week on call is incident command. There was major AWS outage and the second day I was incident command. So that was very exciting day. There’s also cascading effects on other third party providers who use AWS. So if you’re using third party providers that relied on AWS, those were probably broken as well. So I do all the prep I can with all documentations and knowing what tools I need to use and how to use those tools. But there’s nothing like getting thrown into it and especially for a big bad,
Kat Gaines: Yeah, nothing like trial by fire.
Iris Carrera: Totally, where most things are on fire for a lot of people and a lot on the internet too, apart from the company I worked at.
Kat Gaines: Yeah. So just talk us through in that moment when you realized, okay, my first week as an incident commander and something really big is going down and happening. And like you said, you’d done prep, you’d done everything you could to be ready for the moment, but how did you really jump in when things went down and how did you maybe get support from your peers? Or what did that look like?
Iris Carrera: So part of it is like someone on a software team page like, “Hey, there’s a problem with this particular service,” or I joined on in incident command and so there’s some subject matter experts in the room and then going over the issue and trying to see what the cause of the issue was since we were already aware of ongoing AWS outage, we thought that might be problem for at least some services. And that was the case. And so for me, I had to keep internal stakeholders and external stakeholders up to date. So updating status pages, updating critical internal coms with what’s going on, what services are impacted, how customers are impacted and things like that.
Iris Carrera: And outages are really unfortunate, but they’re also, perhaps that type of outage is maybe a good first outage or incident for a new incident commander, because it’s really, you just hurry up and wait, you can’t really do much other than monitor and see when things come back and keep folks up to date, keep customers up to date, keep internal stakeholders up to date. So even though it’s very stressful, a lot of things were broken in regards of directing people. I didn’t have to do too much of that. I’ve done more of that in incidents after that rotation or that particular day, but good trial by fire in the end. And yeah, my coworkers, my other S3 coworkers are supportive and seeing if I need anything and things like that. So I feel like it’s a lot of good team effort overall, and that’s just how it goes too, you need to have effort to resolve any incident.
Kat Gaines: I think it’s interesting what you’re saying about it. It’s about as laid back as an incident can be when it’s not something you can actively work on with your team but you have to just hang around and wait. I think I remember a couple internet wide issues from years past where lead experience similar. And it was just kind of like, all right, let’s just wait around and see what happens, but everybody’s available to make sure that it is resolved when it is. And we know when things are back up and running. That was interesting. You mentioned that you have had a couple since then, and I’m curious about how you feel like that one prepared you for those, especially if maybe were they on the other side of things where you did have more of an active role to do, or what did that look like?
Iris Carrera: Yeah, I definitely think kind of cutting my teeth on a major third party infrastructure outage just got me into the process. These are the things I need to do. These are the pages I need to, or the slack channels I need to keep folks up to date with. These are the type of questions I need to ask and type of answers I need or that stakeholders want, what is customer impact and things like that. And just trying to get to a resolution as quickly as possible so we can close the incident as quickly as possible.
Iris Carrera: So I think it helped me like smooth out the process. So just the more boring parts of it I could do better, like, “Okay, I need to open this piece of software type something in there and then type something else and a different document.” So those pieces of it, I was able to smooth out. And so I could be more present in the other parts of incident command. It’s more of like conversation and helping guide folks towards a resolution and keeping conversations on track and stuff like that.
Kat Gaines: That makes sense. It sounds like you’re juggling a lot when you’re working as an incident commander and it’s a lot of different moving pieces to make sure that are all flowing. And it sounds like every organization does this differently. It sounds like you’re responsible for internal coms. At Dutchie, are you responsible for external coms too, as an incident commander or someone handling that piece?
Iris Carrera: Yeah. So in the past, I’ve as just like a regular subject matter expert, I’ve been in on call rotations where there’s someone who subscribe specifically, where their job is to record or keep folks up to date internally. But as it stands in the current company I’m at, and the role we are doing, keeping folks updated, do a lot of things. At least we don’t have to be the subject matter expert so that’s great. We just focus on keeping things moving, working towards resolution, letting the subject matter experts shine and do their thing.
Kat Gaines: So you have incident commander, you have subject matter experts. Do you have other roles in your process right now?
Iris Carrera: No, I think that’s pretty much it at the moment. It’s also process that’s evolving, I think at my company and it’s going pretty well. Folks are picking it up and getting the process down too as subject matter experts.
Kat Gaines: Cool. So I think that something else that I see sometimes is that you might have to folks who are interested in being part of incident response roles or incident command. And they’re not sure if they can get involved, if they want to get involved. What would be your advice to those folks? Maybe even just company agnostic, just in terms of, if you want to get involved, what are the steps one usually takes to try and get trained up or be somebody who can get on rotation and help hold some of that responsibility?
Iris Carrera: Yeah. That’s a really good question. I think first, just on call experience as a subject matter expert, you need to be able to empathize with a subject matter expert in order to be able to attempt to lead folks. Those are skills you build as someone who is on call as a subject matter expert. That’s my opinion, perhaps folks might have different opinions. That was the case for me at least. And that’s how I feel about it. Having on call experience prior to being incident command because you had to know about the things you’re leading. And building good relationships with engineers across your organization can be helpful. That might be helpful while your incident command and there is something you need to be able to do, have good rapport with folks, being able to be a good communicator in times of crisis is an important skill to have. Trying to think of ways folks build that.
Iris Carrera: Let’s see, for me, I played rugby for a long time and I played the position it was scrum half, which is like football quarterback. So it’s like a directive you’re calling plays. So that’s some area that I draw correlation from being able to have those skills. I’ve hone those skills in a different area in my life. So that said, if there are other areas outside of software where you feel like you can hone those skills, they’re totally applicable because it’s all people at the end of the day.
Kat Gaines: Yeah. I think that’s really important because I think that people do tend to, when they’re psyching themselves out about what do I want to do or is there a way that I can sign up do this thing that I’m excited about. Often you’ll disqualify yourself and say like, “I don’t have in depth knowledge about that subject or I’m not a subject matter expert on all of the things. So I couldn’t possibly do it.” But what you’re talking about is the human side of incident command too, right?
Iris Carrera: Absolutely.
Kat Gaines: The empathy side, the relationship building, like you said, the communication skills and exactly as you’re saying that can come from so many different places in your life that when you realize you can pull from those experiences, that is so valuable and it can really help in the moment, right?
Iris Carrera: Definitely, yeah. And at the end of the day, I feel like site reliability engineering and incident command it’s people, it’s relationships, it’s psychological safety during an incident, letting the subject matter experts do what they do best. You don’t have to know the product in depth that you’re running the incident for, but you have to be able to ask the right questions and just make sure things stay on track and make tough decisions when needed. That’s part of the job as well.
Kat Gaines: Yeah. It’s like you were saying earlier, allowing that room for the subject matter experts to shine. And a lot of that is about asking those correct questions and building that psychological safety too, so that if someone needs to bring up something that they think needs to be done or that needs to be considered as part of the incident response, it’s an okay place for them to do so, they’re not doing the same thing and disqualifying their own knowledge, even if they have something really valuable to add that, right?
Iris Carrera: Absolutely.
Kat Gaines: Yeah. Something I’m curious about is, is there a myth or common misconception about incident response that you want to debunk and tell people to just set that aside because it’s not true?
Iris Carrera: Sure. On the vein of some things we’ve been discussing so far, you do not need to be the subject matter expert of the systems you’re leading incident response for and again, as someone in a leadership role, the job of leaders is to let people do what they do best as uninhibited as possible and then just making sure folks stay on track. You’re not really like throwing your weight around and being like, “I’m the boss. I’m in charge.” And when I say those, sometimes you have to be very firm about decisions when there’s disagreements, stressful and just try to stay calm. So mostly things about leadership and you don’t have to be subject matter expert.
Kat Gaines: Yeah. That makes sense. You have to be the person who is the calm, even if everyone else is feeling a little off kilter and then you can go back later and go, “Oh my God, that was insane.”
Iris Carrera: Absolutely. You could vent to somebody else not in the moment, we’re all human, we all have wide range of emotions having maybe like confidant or something you can vent to is also helpful.
Kat Gaines: Yeah. Or just a team vent session sometimes. I found those helpful sometimes after major incidents where everyone just goes, “Okay, that was insane. Let’s just talk about it for a moment and then move on.” In our last few minutes here, two things that we ask every guest on the show. So one thing that I would love to ask you is, what’s something that you wish you would’ve known sooner when it comes to incident command and especially in the topic of being new to it, whether it’s completely or at a new organization.
Iris Carrera: It is challenging. However, it is less challenging than I expected because I don’t have to be the subject matter expert. I’m just guiding folks and getting info mostly to the right place and making decisions like tiebreakers or whatnot. But I feel like it’s easier than being a subject matter expert. Because I don’t have to do the bug fix or things like that under pressure.
Kat Gaines: That can help with the mental hurdle too. Knowing that it’s less challenging than you think it’s going to be. And then the second one, I’d like to ask is, is there anything about being new to incident commander, being an incident commander that either … I’ll put this two ways, either that you’re glad we didn’t ask about or that we didn’t ask about and we really should talk about?
Iris Carrera: Yeah. I guess something that would be good to talk about or question is like around decompressing after being on call or taking breaks. So sometimes there’s a long running incident and you need to go eat lunch and you’ve been on that phone call for hours and you have other people you can tap on. You are person, not a superhero, you can ask for help when you need it to and rest is important.
Kat Gaines: It’s so important. I think that we all try to be the hero but if you are completely burnt out or exhausted, you’re not going to be helping anybody. And that’s why we have teams around us so that we have people who can take the next shift. We can pass it off to and give them enough context to understand what’s going on and go have that time. Like you’re saying to eat lunch, to just be a person, if it’s an especially long running one, just to get a break from the screen for a while, if you need to, because those can be beasts and make sure that you’re still able to bring the best version of you to what’s going on. Because if you can’t, it’s going to be 10 times worse than if you just weren’t there.
Iris Carrera: Absolutely. I agree 110%. I’ve had folks that I’ve been like, “Hey, this is going on for a long time, I need to step away for half an hour and have my coworkers cover me.” And that makes a world of a difference. If I didn’t get a break, I don’t think I could have made the right types of decisions I needed to as incident commander. So it’s a important to have a team around you.
Kat Gaines: And again, it can be high stakes. So if you are exhausted or burnt out and you’re making mistakes, it’s going to be maybe more visible than when you’re doing something in your day to day work. And I think that’s something that we all need to remind ourselves of is that, it’s okay to take the time to be yourself, especially when the consequences of not doing so could be a lot bigger than they are on any other day. Very very last question. Just before we let you go, Iris, it’s been really fun having you on the show and chatting. Is there anything that you want to plug? Anything that folks should go check out after the show?
Iris Carrera: Dutchie is hiring engineers and other roles. So check them out. If you’re interested in working in the cannabis tech space, it’s a great company to work for, check them out.
Kat Gaines: We will drop a link in the show notes so that people who want to go check that out can have an easy way to go find it. Thanks again, Iris. Thanks for chatting with us today.
Iris Carrera: Thank you as well. This was fun.
Kat Gaines: Thanks all. So this is Kat Gaines signing off and wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making the podcast possible. Remember to subscribe in your favorite podcatcher, if you like what it you’ve heard, you can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two. Thank you so much for joining us. And remember uneventful days are beautiful days.
Iris Carrera is a Senior Site Reliability Engineer at Dutchie focused on observability and incident response. Prior to Dutchie she worked at HashiCorp building the infrastructure that supports HashiCorp Cloud Platform. Iris has worked on infrastructure and site reliability in aerospace, cannabis, and cloud PaaS environments. Iris lives in Seattle, WA, with her partner and pup.
Kat is a developer advocate at PagerDuty. She enjoys talking and thinking about incident response, customer support, and automating the creation of a delightful end-user and employee experience. She previously ran Global Customer Support at PagerDuty, and as a result it’s hard to get her to stop talking about the potential career paths for tech support professionals. In her spare time, Kat is a mediocre plant parent and a slightly less mediocre pet parent to a rabbit named Lupin.