Sustainable On-call Culture with Paige Cruz

Transcript

Kat Gaines: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Kat Gaines, and you can find me on Twitter @StrawberryF1eld using the number one for the letter “I.” Okay. Hi folks. Welcome back to Page It to the Limit. Again, I’m Kat Gaines. Today, we are here to talk about building and iterating upon a sustainable on-call culture. This is a topic that I’m really excited about, I think I’ve been part of and seen others be part of multiple attempts at what sustainability looks like in an on-call culture. I think it’s really interesting too, just to observe and talk to others about what they’re doing. To do that today we are joined by Paige Cruz from Chronosphere. Paige, why don’t you go ahead and just tell our audience a little bit about yourself.

Paige Cruz: Hi everyone. I am a recently retired site reliability engineer. We’ll get into that a little bit later, and current developer advocate at Chronosphere. What really gets me up in the morning is thinking about the humans in the system and how we can support them. My particular passions are for observability, particularly open source instrumentation. Shout out to Prometheus and Open Telemetry. Thanks to Chronosphere. My time today here we can focus on how to craft these sustainable on-call cultures, and just like you mentioned, Kat, it’s iterative. We make attempts at first. The journey is ever-growing.

Kat Gaines: It is. I like that recently retired label too. I didn’t think about doing that when I myself switched over to being a developer advocate, so I might steal it…

Paige Cruz: Borrow it. It’s yours.

Kat Gaines: It’s like a year and change retired-ish now from my previous career, but it’s fine. Let’s just, for our audience, let’s level set and let’s talk about what sustainable operations really means in this context. I think it’d be great too to talk a little bit about why organizations should really be investing in this and improving their on-call culture this year and going forward.

Paige Cruz: Absolutely. I think the first thing that comes to a lot of folks’ mind when they hear sustainability is they kind of think about eco-friendliness and the impact to the environment. While making sure that we are good stewards of our compute resources, the amount of bits and bytes we send across the wire and across networks is absolutely a component of sustainable operations. To me, the piece that we need to focus on in an industry is the humans. We do put a lot of humans on-call with pagers responding to sometimes false alerts, and then sometimes very, very serious alerts. When I think about the human resource here, sustainable operations is about how do we sustain a group of confident and trained on-call operators in a system, in a world where we’ve just seen, what is it, the great resignation and quiet quitting and people joke about software engineers having a two-year tenure at companies? Well, the sustainability of getting that institutional knowledge to know what you’re deploying, how you deploy it, the topology of your infrastructure, that stuff takes time to build up. Really, this term came into prominence for me after I kind of tweeted this throwaway tweet: “a three-person rotation is a bigger risk to the system than only being in one AZ or availability zone”. Similarly too, we don’t, we should not accept single points of failure in our architecture. We shouldn’t do that for our on-call rotations either. In a nutshell, to me, sustainable operations is: not burning out your human operators in order to maintain a functioning technical platform or product.

Kat Gaines: Yeah, I mean, I agree. I think it was interesting at your point about people usually think about sustainability in terms of the environment, and it is just a different environment that you’re investing in the sustainability of. I think that it’s about too, right, getting back to the definition of what sustainability really is, which is making sure that there is an operable set of standards and practices going forward, not just for the moment or for the team composition as it is right now or for what the team owns as it is right now, but that however things change, your setup and your process can shift and move with that without burning anyone out, like you’re saying, making sure that those humans are still protected at the end of everything, right?

Paige Cruz: Absolutely. I’ll add to that, what I think gets missed maybe for sustainable operations is if we look at a company, we look at an average tech company for the last 24 months say. We have seen as an industry lots of layoffs. You’re losing key people who understood your systems. We see shrinking capital. The way that I interpret that is you really need to start justifying your tool spent and maybe your monitoring bill X, Y, Z. There’s a lot more scrutiny on the financial aspect. Then there’s also just that normal attrition of folks like me retiring and saying, whoa, this has not been sustainable and I can’t continue to sustain my on-call efforts here. When you think about that in macro, let alone reorgs where maybe you’ve shuffled teams and services around, but you haven’t updated your rotations. Oh my gosh, that’s a recipe for burnout.

Kat Gaines: Yeah, it is. Especially when it’s really hard to manage those things to get visibility of them too, that everyone’s trying to not burn their people out and be ahead of it. It can be hard when you don’t really know what’s going on or you don’t know where things are in the rotation. I think that’s maybe a good segue into a question I’m interested in, which is, what are our red flags for an on-call culture that might not be sustainable? What are the types of things we typically see that could tip you off to saying, oh, this probably needs some work?

Paige Cruz: Yeah, absolutely. The most serious one is a bit of a lagging indicator, which is attrition. Depending on how much trust your HR org has with those departing, you may not even find out in an exit interview that it was being on a particular rotation. If you look at the macro trends of, oh, seems like our database team’s cycling through a lot of folks, you can start to say, oh, maybe that’s a place we need to invest in. I particularly think we need to be cautious for signs for folks working in cross-team or cross-functional roles. The SREs who need to know the infra layer, the app layer, they need to know from the edge down to the container what’s going on. Those folks, they’ve got a really good pulse on what’s sustainable for them and what’s not sustainable for the org. Some other signs that are pretty common would be relying on the same folks over and over to handle incident response in to fight those fires. It is really tempting when you’ve got somebody that is so seasoned and tenured and they can walk in and clean up a big mess in 15 minutes and then go about their day. Of course, that seems to be in the short term, the best option, but you’re robbing other folks of that ability to train and learn and to get to that level. I would look out for the same folks responding. That can be either company wide in the case of an SRE or even your team, if you’re always kind of turning to your lead and expecting your lead to hop in. That’s a lot of responsibility for them, when we all share the pager at the end of the day. Some other signs that I’ve seen are what I call spammy alerts, and that’s when you’ve got really just a lot of low quality signal to noise ratio. Those are alerts that people will put on perma-mute, like if you’re going to permanently mute an alert, I’d rather you just delete it and then we don’t have to worry about it. Then the second piece would be a lack of training and enablement. That is a big red flag about unsustainability because if you don’t have either a written culture or even just, I even like to take the 12 factor app. If you just took the 12 factor app, the 12 tenets of that. For each of those, put one Google Slides, one PowerPoint slide per thing, and said at this org, this is how we configure. This is what a deployment consists of. Something as simple as that can go a long way to making people confident and comfortable and understanding their system. Those would be the big signs is if you’ve got a lot of people cycling out, if you’ve got people who do not feel confident enough on call to handle things themselves after that probationary period and they’re always turning to the leads and you’ve got that hero culture going, and then if you look at those alert channels and they’re just lighting up and you ask, “Hey, is there an issue going on?” Someone says, “Nope, those are the normal alerts,” that’s a big sign that you got to look at what you’ve got going on.

Kat Gaines: Yeah, I think those are all great notes, and I kind of want to double down on the hero culture piece a little bit. I think it’s so easy to fall into the trap of saying like, “Oh, Paige is really great at handling this type of issue, or Kat’s really great at this thing,” and then you keep going to that person every time that that issue cropped up, whether it’s formal on-call when there are incidents occurring, or just managing some type of issue and seeing it through. I think that there can be a little bit of a misconception from leaders that it’s like, “Well, they’re great at this and so I want them to do it,” and so that is speaking to their ability, their intellect, their skill, and it should be a confidence boost that they’re really great at this. You can praise someone for how good they are at a task without making them be the only person who can do that task.

Paige Cruz: Yes.

Kat Gaines: It can be really exhausting to be the only person in a room who feels like they, not who feels like they’re competent, but who feels like they’re able to handle an issue, because as you said, it’s taking away opportunities from other people to grow and to do those same types of things. I might be in a room and be the only person who knows to do X, Y, Z task, but if we take the pressure off of me and we give it to someone else and say, “Okay, go do this thing,” they don’t know how, but they’re skilled in that too. That’s spreading the knowledge around. That’s also, honestly, it’s insurance that you’re creating backups of people who can do things, right?

Paige Cruz: Yes.

Kat Gaines: You never want to be a single point of failure. It’s terrifying when that’s the case, and I think that it’s just too common for leaders to say, “Oh, well, this is me giving accolades to this person, and they must feel good about it,” when in reality they’re probably exhausted.

Paige Cruz: Yes. The way I like to explain it to leaders is if you’re familiar with the bus factor, what I like to say instead is the lottery factor. If this amazing superhero architect that you’ve got, that’s your key first responder won the lottery tomorrow, they won a zillion dollars and they decided to leave their post immediately, what impact would that have if that same day you had a big incident and you wanted to call them in? That sometimes helps frame the risk in a way that the business understands a bit more, but it’s an uphill battle because you’re totally right that when someone can solve one problem in five minutes versus a newbie who needs to go look at the docs and maybe spend 30 minutes responding, it’s sometimes hard in the moment to justify that. That’s why I’ll just say we’re starting a movement right now, sustainable operations, sustainable on-call, it’s about looking at that medium term, the long-term, who are the future operators on-call? Because the people who built the internet aren’t always going to be here to operate it, and we’ve got to pass that knowledge down.

Kat Gaines: Yeah, that’s a great point. If everything falls apart when somebody leaves their post, you don’t have sustainability and you weren’t doing the right thing in terms of how you manage that person’s skill and others around them.

Paige Cruz: Yes.

Kat Gaines: Talking about the business impact, let’s talk about that a little bit more. There can be a little bit of, I think, maybe a side eye from leadership around some of these things in terms of, especially when you get into talking about investing in the human side and really thinking about people and how they’re spread out and what that looks like. Again, they’re looking for efficiency. They’re looking for who can do it fast, how it can be done right, not taking risks, but let’s talk about maybe helping people listening to this make the case to their upper leadership. Why should organizations be pursuing sustainable operations? What does ROI look like? What are the risks if they don’t do anything at all, if they just stay where they are right now?

Paige Cruz: Yeah, totally. Why is a great question. Why should organizations even join this movement? There’s a few reasons. When I look on the outside, if you’re selling a technical SaaS platform or a product or a tool, even if your users are not technical at the end of the day thinking about ring doorbells, our users, the average person is becoming more and more tech fluent. My parents have said the words AWS to me, right? Like-

Kat Gaines: Oh my gosh.

Paige Cruz: Right, so they used to know amazon.com, I buy stuff there. Now they’re like, “What’s this AWS thing?” I’m like, “Where do I start? Oh, gosh.”

Kat Gaines: Where did you hear about that?

Paige Cruz: Yeah. Yeah. Where are you going on the internet? Let me check out your profile. When we’re approaching this world where more and more users are starting to understand that the clouds are providing a lot of our computing power, and when cloud issues happen, there’s widespread outages, they’re also starting to look for the website. Is it down for everyone or just me? There are so many different signals now, even if your company isn’t the one to say, “Hey, we’re aware that there’s an outage, we know that it’s scope to these features, and we’re going to update you when we know more and we’re working to fix it.” You can now go and say, oh my gosh, is it down for everyone or just me? There are so many reports in this active region, or when I think about the ROBLOX outage from a couple years ago, and ROBLOX, as I understand it, is sort of a platform for people to build games and kids can go on there and there’s a whole creator economy. When they had an outage, there were actually other companies that would scrape the analytics and data of the players to show who was super popular. This total third party external entity had graphs showing the ROBLOX outage and were giving the updates from their vantage point, and you’re like, whoa. If I’m a leader of a business, I want to be the one in charge of that messaging. I want to have that proactive approach to say, “Hey, we know this. We’re working on it.” I would not be super jazzed if my users got more information from a third party downstream system than from me and myself. I think that’s the first thing to be aware of for the business case, is your users are going to find ways to understand what’s going on with your system, whether or not you’re making this investment, and it’s about how you want to protect your brand identity. Honestly, if reliability is a key feature set for you, you need to make good on that promise, and people will be watching

Kat Gaines: Yeah, very much so, and like you’re saying, give your users more credit. They’re going to want to know what’s going on. They’re going to want more information. Your parents know about AWS, so people are going to know about what’s going on. They’re going to be able to find things out for themselves. If you’re intentionally obscure with information, or for example, I sometimes see people say, “Oh, well, we need to kind of ‘dumb down’ the technical detail that we’re putting into our status page posts or things like that.” No, you don’t. You actually need to give folks more detail than you think that you do, because at the end of the day, they’re going to go hunt it down anyway. Also, if you’re not giving them the detail, guess what they’re going to do? They’re still going to ask for it, so then they’re going to hound your customer support team, for example. They’re going to hound their sales reps. They’re going to be just poking at these people saying, “Where are these details?” Those people are going to be saying, “Well, we’re not the engineering team. We don’t know.” Then they’re going to be coming back to the engineering team saying, “Hey, we need these details,” so you’re saving everybody time at the end of the day by just being transparent and also, equipping your customer facing teams. This is something I never shut up about, but equipping your customer facing teams with all of the detail that you possibly can, giving them the door, not hiding things behind a black box for them so that they’re able to be good ambassadors of your brand and your organization to your customers, and at the same time, allow your engineers to do their jobs on actually fixing the issue instead of constantly answering taps on the shoulder about answering questions about what’s going on.

Paige Cruz: Absolutely. You’re spot on about that status page update. The real trick for me is finding that what is just enough of the technical detail for your audience? I’ve been in a bubble of working for B2B developer tools specifically, and at companies I work at when we update the status page, engineers are reading that. They want to know a little bit more than our database was on the fritz. When I think about a company, I think they’re a company that does things really well, is my power company, ironically, Portland General Electric. When we have an outage, they have a whole status page that says it’s the whole timeline of an incident, and it tells you where they are of, “Hey, we’re aware there’s a problem. Hey, we’ve dispatched a crew.” So even just knowing if people are working on it, that’s one step. Then they’ll update us with, “Hey, it’s a tree. We found out it’s a tree, don’t worry. It’s a little bit localized.” Then they’ll say, “Okay, and then we fixed it.” They also, because like you said, customers are hungry for that info, they’re going to ask, they’re going to clog your support queues and rightfully so. They’re paying for a product. What PG&E does that I love is they’ve got a live map that shows you when you report an outage, it’ll show you how many other people around you. Once they’ve declared, “Yes, what you’ve reported is associated with this incident.” They will say, 2,000 customers affected or whatever. If I saw only 10 people affected, I’d be like, oh, unlucky me in my neighborhood, but if I see 5,000, I’m like, oh, I got to give them them a little grace, because that’s a lot of people and they’re probably running all around our wires. Those are some ways that having that transparency, one, I do one thing, I say, “Hey, I have an outage,” and then I refresh the heck out of that live map because it’s not making them solve the problem faster, but it’s giving me as that user a little bit more assurance. Ironic, because no one loves their power company.

Kat Gaines: No one does. I was just thinking, gosh, that’s impressive for a utility company. It does show that they’re really doing the thing we’re talking about, which is they know their audience, they know what kind of information people are going to be looking for. They know what kind of power as a user you want to get out of that interaction and what you want to be able to do to contribute, I guess, in any way that you can. They’re able to size that response and size out the information that they’re giving appropriately, which probably helps them save a bunch of time. I’m guessing, and hoping that they have some templates for these things. I’m guessing it’s three more than once in a while, and that they’re able to really just take those things and say, “Okay, this is what’s happening,” and they have a playbook right around what’s going on, what they need to do next. They have that information there, and they’re again, thinking about the people involved, both their people and their audience.

Paige Cruz: Totally. Yeah, so that’s a big why is really the reason that any org exists, your customers, your users. That’s the first one I like to start with. And the second one we’ve already touched on a bit, which is that seasoned incident responders did not grow on trees. They are born either from the hard knocks of life and that real world experience, or if you are lucky, a very nice onboarding and sort of a investment from a company in the growth to go from zero to confident on-caller.

Kat Gaines: The idyllic example. Yeah.

Paige Cruz: Yeah, ‘cause for me, I honestly feared the pager for such a long time, and that fear prevented me from taking steps to empower myself to learn more about the system. I just thought, I’ll never know everything or the alerts will tell me when things are wrong. We really need to cultivate not only a curiosity about production data but also, opening it up. Like you said, we could empower customer success with dashboards, and I think customer success in certain companies should totally have access to the monitoring tools, maybe in a read only mode, but making that data available for people who are interested to plumb through and find things is really important. The sustainability aspect here is not only keeping the lights on for today and tomorrow, but next month, next quarter, next year.

Kat Gaines: Yeah, exactly. I mean, I think this is a little bit of a tangent, but if I can for just a moment, when you’re talking to us now, empowering folks in customer facing roles, so something that I’ve seen when I used to lead support teams was that a lot of them had aspirations to move into engineering later on down the line, and so …

Paige Cruz: The best.

Kat Gaines: … [inaudible 00:20:50] it’s a great transition because you’re bringing all that rich, deep customer knowledge into an organization where their peers might not have had an opportunity to get exposed to that.

Paige Cruz: The warts and all. Support knows the warts and all of your product.

Kat Gaines: Exactly. They’ve seen the hairiest, grossest pieces of the product and the worst problems, and then they’re just itching to fix them. When you do that, when you expose information and when you expose dashboards and things to people, you not only help them in their current role in the moment, you’re helping them obviously, respond to customers and give them accurate information, but then you’re doing something else that’s really kind of magical. It’s just that you’re building relationships across your org in this beautiful way where everybody can see the same information and the same data, and everybody can talk about it in the same context because they’re seeing the same thing. Where you do have people who are aspiring to maybe change roles, to move into a different part of the org, to do some parallel transfers, things like that, then you’re building relationships. You’re helping them network, which is really hard to come by, especially these days, right? You’re helping them network with people in other parts of orgs in different roles, and you’re helping them then kind of get their name out there as someone who, when a role opens up, they might be thought of as, “Ooh, this is a possible internal candidate that could actually fill this role or grow into this role for us, and maybe we should consider that instead of, again, hiring externally, but start promoting more growth within the company too.” I don’t know. I think it’s a win-win for your customers, for your business, I guess win-win-win, ‘cause I’m saying three things, but for your teams and the people in them, right? Everybody’s benefiting a lot here from that transparency.

Paige Cruz: Totally, and to kind of wrap up the answer with a bow, so if I’m a CEO or I’m a leader, why should I consider learning more about this or pursuing this or making an initiative? It’s because our user expectations have risen drastically, and the more tech fluent they’re getting, the higher expectations they have, not only of availability, but reliability and making sure everything’s worked. You want to serve your users well because you want to keep them. The second reason you should do this is because the engineers you have today may not continue on with you. Maybe you do have cycle time of two years per engineer. Well, how do we make sure that those two years are the most impactful they can? How do you make sure that if that’s the average at your company that you’re set up to onboard and bring people on call in a sustainable fashion planning for that? The ROI is not only are you getting, it’s not that you’re erasing incidents that is never going to happen. We are all imperfect. Humans are building this X, Y, Z, but the ROI is you’re going to get more people who understand how to query data about the system, how to find out things, how to maybe tune their alerts. I’m not going to promise that your incidents are going to go from one hour to 30 minutes because every incident is a unique beast, but it is the teamwork and collaboration that will speed up the response, and that could turn that something from, oh, I made a config change and I wasn’t watching it deploy to, oh, I know how my code goes from PR to production, and if there’s an issue, I can fix it before it becomes a problem. The risk by doing nothing is letting what I call monitoring debt accrue, which is where you continue to develop your code, you continue to upgrade your systems, please upgrade your systems on a normal cadence. If you are not updating your alerts, as your system changes, they get invalid or inactive or what I call spammy really quick. There’s a lot of research in the medical field that shows the more that a doctor responds to a false alarm, each instance at a certain point, their ability to see it as a valid alert and react to it goes down by 30%. For each false alarm of the same alarm they get, the more they’re normalizing, “Oh, when an alarm fires, it doesn’t actually mean there’s a problem,” which is not what we want our on-call responders to do.

Kat Gaines: Yeah. You don’t want to have so much lack of sensitivity to those alarms that you just treat them like noise and they’re just there, right? We’re talking about, like you said, we’re starting a movement here. It’s the sustainable on-call culture movement. It’s happening, and this conversation.

Paige Cruz: Everyone’s invited right now.

Kat Gaines: Everyone’s invited. You’re all on board if you’re listening, you’re part of it. Congratulations. Who’s responsible for this? I guess I just said everyone’s part of it, but really, when we’re talking about in an organization and seeding the change, really, is it your SRE team? Is it your DevOps engineers? Is it your CTO? What do we think?

Paige Cruz: Yeah, organizational change. I think that is what SRE should just be called, organizational change engineer. It’s tricky because you need all levels to be bought in. They do not all need to be bought in and be totally on board, wearing the T-shirt, sustainable on-calls, but they need to understand that the status quo as is isn’t working for folks. Whether you do a formal survey, whether you do anecdotes or whether you’re a manager and you said, “Look, I’ve lost the third engineer this quarter. It’s really hard to hire this role.” Something’s got to give right at the point that anyone at any level says, “This is not sustainable.” Whatever role you’re at, if you’re at the SRE or the individual contributor level, the first thing you need to do is find an executive sponsor. You cannot pursue sustainable operations on your own. The business needs to understand the value of what you’re doing. If you’re an engineer that is totally tired, you are deep in alert fatigue, raise that flag to your manager, “Here’s how I’m feeling. This is why I’m feeling that way.” Even if you don’t have the answers, say, “What could we do about this?” If you’re an IC, find that manager, find that director, find the CTO, find that person who’s a business leader, make your case and see what understanding you can come to and figure out what is reasonable. Because when you’re in the pit alert fatigue, you feel like there’s no way out and you’re like it. The problem is everywhere. I get a thousand alerts every rotation or whatever, and it can feel too big to bite off. When you raise that flag up, the business leaders can say, “Whoa, we don’t care about every environment say, or let’s just focus on you’re an e-commerce site. Let’s just make sure that we focus on the checkout flow or the edge team,” or whatever it is. That’s from IC. Find your business leader. If you’re a business leader, oh my gosh, and you’re, trying to seed this change hats off to you, we need more of you. Please go clone yourself.

Kat Gaines: Absolutely. Please. Yeah.

Paige Cruz: In that case, if you are maybe a director of SRE and you’re like, “Hey, how do I make sure if I’m responsible for reliability and now sustainable operations, how do I make sure with my powers that I can do that?” That’s when you want to look at the data. Once you’re coming at it from the business side, go to your pager stats, go to your incident stats. I hope you’re keeping these historically available and queryable. See from a business lens what the impact of the current status quo is. Then the important thing for business leaders to do is go talk to the people carrying the pager, match those hard stats with the annec data and lived experiences of the people holding the pager, because you may think, “Oh, team Z is really having a problem. We should start there.” When actually, you’re about to lose somebody on team Y because they’re fed up even though it’s been a simmering fire, not a whole forest fire. What is the real answer to all of this and DevOps and SRE? Is just please talk to people in your org. People that don’t sit on your team and do not have your same role. That’s where you start.

Kat Gaines: Yep, exactly. I agree 100%. We were talking a few minutes ago too about just kind of sharing tooling, sharing visibility as well, I think across both your teams, your leadership, knowing what type of visibility everybody needs, right? I think we want to be careful around how we talk about this because you’re not going to buy a tool and it’s not going to magically solve all of your sustainability on call culture problems, right? There is no silver bullet here. There is nothing that’s going to just fix it in a snap, but it is something that plays a pretty big role in making sure that those humans behind everything are supported and have what they need and have that visibility. How can tooling really support efforts to improve our on-call culture? Is there anything that you want to call out in terms of tools, products, features that people should be looking at and checking out?

Paige Cruz: Yeah, absolutely. That distinction is so key for me. Tools are not the answer, but tools are a part of the answer. It’s very trite, but I like to let computers do what they do best, which is crunching a lot of numbers really quickly, making comparisons and what humans do best, I like to let us do that. We’re great at creative novel problem solving, and we’re really good with charts. We’re really good with visual depictions of data and interpreting that quickly. When I think about tooling, that helps for observability distributed tracing. If you are working within a microservices architecture where you’ve got hundreds of tiny services all running around and it’s hard to figure out what’s going on when somebody logs in, try to introduce some distributed tracing. What that helps with is it tells the story of one request throughout your system. While it is totally helpful for production on-call, I actually think it supports the humans way before you even pick up the pager. I will walk into an organization with tracing, and I use that to actually onboard myself to the system and say, “Oh, this talks to this, or, oh my God, why do we have two proxies in here? What’s going on with engine X? Traces can create living architecture diagrams, and I know most engineers are loathe to write documentation, let alone draw a really complicated architecture diagram themselves because it goes stale so fast. I think tracing provides a lot of different ways for you to enable the human in the loop as well. I will plug the Open Source project, Open Telemetry, which has gotten all of big monitoring at the table. We are the one industry standard for tracing, and everybody’s invited to the O-Tel party, so that’s a big one. The other two, as far as observability goes, which is how observability is really the how can your engineers understand the system are taking tours of dashboards. I love a good note in a dashboard that says, this is why this is here. This is what this chart means, X, Y, Z, but a little video with a 30-second walkthrough from your lead. This is how I look at the elastic search dashboard that could be worth its weight in gold 10 times over because it’s putting it in those human terms. The last bit for the observability specifically is descriptive alert titles. Oh my goodness gracious. If I hear another, I’m most familiar with PagerDuty. If I hear the robot that’s like the average alert for the metric volume for the response time on the P95 on the service X in this region, I’m like, I don’t even know what’s wrong. I don’t know if I need to-

Kat Gaines: How can you tell from that? Yeah.

Paige Cruz: Am I ripping off the covers and running downstairs, or am I like, “Oh, I’ve got 10 minutes. I can slap my face a bit, get a cup of coffee and be good?” Please, please, please. Alerts are for humans. Dashboards are for humans. The computers don’t look at those, right? Make them useful for the humans.

Kat Gaines: Name your alerts, name your services, all of it in a way that a human being is going to understand at three in the morning when they don’t know where they are. Yeah.

Paige Cruz: Yeah, and I think about a new person joining an org that loves to use fun names for services like food at a buffet. I’m like, I don’t know what Penne Pasta is supposed to call out to like, oh, dear God, please disrupt their names. I have worked in a system like that that was half fun names and half descriptive names, and it was, oh, it was interesting.

Kat Gaines: I bet.

Paige Cruz: As far as kind of your alerting system, so I’m most familiar with PagerDuty, so that’s kind of where my examples come from, but your alerting system has really valuable rich data. It should, about who’s involved in these incidents, what alerts are going off the most, who’s getting interrupted at 5:00 AM and 7:00 PM, who’s been on call the most? I think the most shocking stat I pulled out of a PagerDuty was for an org I sorted everybody based on who had the most on-call time, and there was someone who was perma on-call. I look at that data and I scream. I scream, I’m like that’s not sustainable. In a nutshell, 24/7 on-call, not sustainable for one human. Then I go talk to that person and they’re like, “It wasn’t that big of a deal, that service is quiet.” I’m like, “For you, when you leave, do you think it’s fair to pass on that kind of rotation to the next person and hire them into that? Heck no.” Really taking a look with a lens at the human involvement and the human stats for your alerting things that go off, and particularly I think PagerDuty’s got some cool analytics. I don’t know if you want to share a little bit about any quick reports folks can pull or things like that.

Kat Gaines: Yeah, absolutely. I think there’s a lot that people can pull. I would honestly recommend that go poke around in the product if you’re a PagerDuty user or if you’re someone who is in interest in spinning up a trial, poke around in the analytics little, maybe look in our knowledge base too, so you can understand a little bit about the different types of reporting and what’s there. For example, you can see a lot in the reporting around people who are on-call, like Paige was saying.

Paige Cruz: Those heroes.

Kat Gaines: Those heroes, the people who are on-call too much, who it might be for them in the moment, but again, the moment that they’re going to leave is going to be chaos for the next person coming into that, to get an understanding around who might be burnt out soon as a result, who might be just quietly suffering and not saying anything. You can also get some information if you look at, for example, our intelligent dashboards around kind of your business objectives and your team services and what some of the repeat offenders are in terms of where problems are cropping up, and just really drilling into those details a lot. I really want to encourage people to go check that stuff out, spend some time in it, play in the built-in reports, play some of the stuff in PD Labs too, which is really fun and exciting, and just kind of see how much data you can get out of, if you’re using PagerDuty, your team is in there. You do have your services and alerts named in a human readable way so you understand what it’s referring to. You have to have good hygiene in your PagerDuty services for this data to be useful, but I am going to cross my fingers that everyone listening will, they either do or they’ll go clean it up right now if they’re listening.

Paige Cruz: Again, PagerDuty can’t solve it for you. That’s part of the shared responsibility. You’ve got to put in a little bit of work.

Kat Gaines: It really can’t, as much as I would love to say, oh, we can do everything. No, we cannot solve it and do everyone’s jobs for them, and so make sure your service hygiene is good, and then go dive into the reports and spend some time just mining all of that juicy data out of there so you can understand where you need to work on sustainability on your team specifically.

Paige Cruz: Totally, totally.

Kat Gaines: We’ll link some of those things too. Folks, check out the resources in the episode. We’re going to link that. We’re going to link a couple other features that are pretty cool in terms of just understanding what’s going on with your services. We’ll talk about dynamic service maps, business services, probable origin, change events. Those are all things that Paige and I talked about before recording the episode as some interesting features that people might want to check out. I’m going to plug our status page too. That’s a more newly released feature. That is one of those things where earlier when we were talking about getting folks on board across the organization, empowering your customer facing teams as well, it’s going to be a really powerful tool for doing that and for communicating out to your customers. Lots of stuff to check out in the resources section as well.

Paige Cruz: Yeah, you’ll level up after reading through all of that, and the one that I would highlight the most, if you’ve been a PagerDuty user for a long time and you are like, what the heck is a business service? Ooh, start there. That is something that you can use as a bridge between you and the more business oriented folks. It’s how you can have two people in totally different roles look at the same tool and get the same value out of it.

Kat Gaines: Yeah, it’s really powerful stuff to be able to do that. I think that we’ve talked about a lot of different kind of calls to action that we have for folks here. If we were to kind of boil it down to the tactical next steps, if someone is listening and they’re like, “Yes, I’m joining the movement. I’m creating sustainable on-call culture at my company,” what are we telling them to do? What is step 1, 2, 3, maybe?

Paige Cruz: Yeah, step 1, 2, 3, I think get a baseline of what’s the status quo right now. You got to know where you are to know where you’re going. Whether that is spinning up something like an on-call log or on-call diary where your primary on-call just keeps a record of anytime they respond to an alert, anytime they’re going to tune something. Basically, it’s just a little report of how they spent their on-call time. That is such a rich source of information for people onboarding to know, “Oh, this incident happened before and I saw that Joe Bob did this and it solved it. I’m going to try that.” If you don’t already have a way to get that baseline, start a practice of the on-call log. It can be very, very bare bones to start, and it’s something that will evolve within your particular on-call culture. Another thing you might want to consider either starting or refining, is your handoff practice. Are you just doing it in Slack? Are you doing it on a video call? What is the expectation from the person going off-call to the person going on-call, you should have some conversation about, “This is what I’ve noticed. Here’s something you should pay attention to, or, hey, don’t worry, I’m going to actually pick up that thread when I come back online next week.” That handoff is a real critical time for two humans to talk about and share info. Either tightening up or if you’re finding your handoffs are unuseful, think about ways that you could make that better. We don’t have to continue to do things the way they were if they’re not working. That’s your permission to change it. Then the last thing you want to do, so get that baseline, the last piece of it is to ask, I think, either the newest engineers who are going to have the freshest eyes and probably be the most terrified about picking up the pager, what their experience is from all levels, not just the junior engineers, but a senior architect’s not going to walk into an org and know everything two months in. Talk to them about how they’re on-call is going, so get a baseline. Then you’ve got that baseline, share, share, share, share that data. Share your interpretations of the data, ask other roles and functions, how they would interpret it, and how their roles are impacted by having 10 incidents in one month compared to five. Then once you’ve got your baseline, so you’ve got knowledge of what’s the current status is, you’ve got folks who are interested and invested in making some sort of change, then now you get to the fun part where you just get to iterate. Every time someone’s on-call, see what improvement they can make, share, rinse, and repeat. It’ll look different for every org, every system, but that’s sort of in a nutshell, it’s really getting humans talking to each other and not letting computers ruin our lives.

Kat Gaines: Yeah. Oh my gosh. It’s that simple, folks. Just don’t let computers ruin your lives, right?

Paige Cruz: 1, 2, 3.

Kat Gaines: Exactly. One, two, and three. Okay. We’re getting to the end of our conversation here, but Paige, before I let you go, there are two things we do ask every guest on this show. So the first thing I want to know is one thing that you wish you would’ve known sooner when it comes to running software and production?

Paige Cruz: For me, it would be not to be afraid of production because the engineers I admire the most, the most badass ones that I’m like, “Oh my God, you know everything.” They learned it on the job just like I did. That is part of the journey. You don’t get to be a seasoned on-call engineer without just spending time learning how systems break and they will break, and that is nothing to be afraid of. It’s an exciting opportunity to learn more about your system. Stressful sometimes, but a good opportunity.

Kat Gaines: Completely, and then our second one, is there anything about running software and production that you’re glad we did not ask you about?

Paige Cruz: Such a good question. Yes. The immediately what came to mind is how to tune the JVM. It is just not in my wheelhouse, and I hear a horror story. I mean, you just really need to have worked with it for a while. You can ask me anything you want about on-call, except for that.

Kat Gaines: This is not the place. Yes.

Paige Cruz: Yeah.

Kat Gaines: We talked about a couple of PagerDuty resources we want folks to check out, but then Paige, is there anything else that we’re going to link in the show notes that you want people to have a look at?

Paige Cruz: Yeah, if you’d like some more tactical advice on how to go about starting this culture for yourself and you don’t know where to gather that data, I’ve linked to a post I did for CIS Advent called Assembling Your Year In Review. It gives you the tools and the questions to ask to look back at the last 12 months of your org and on-call and production and be the data to facilitate that conversation. The second one that I’m going to throw in there is this book called Sustainable Web Design, and if you listen to this podcast and we’re like, “Hey, I care about the earth, and I thought we were going to talk about that.” This book is what really started turning my gears towards that. It is from the publisher book apart, and it talks about all of the different optimizations we can do from front end to back end to make sure that we’re being good stewards of our resources and not emitting a bunch of CO2 because we didn’t compress our files or something. Those two I think should get you started, and if you really did need a resource for burnout, if you are that engineer that says, “I’m in the pit and I cannot alone climb out and I need to just get my own baseline for myself,” I will include a link to a quiz from Yerbo that will help you score your own level of burnout. You can keep that private, or if it is super telling, you may share it with a manager, and that may be something that spurs some action.

Kat Gaines: Yep. Perfect. Okay. Well, Paige, thank you so much for this conversation. I hope our listeners learned a lot and I had a really great time chatting with you.

Paige Cruz: Me too. Thank you so much everybody for listening, and thank you to PagerDuty for having me and having a really great name that I can borrow from my own handles online.

Kat Gaines: Fantastic. Yeah, we love some PagerDuty around here. All right, folks, thanks again. Again, this is Kat Gaines. We’re wishing you an uneventful day, and do go check out those show notes. They’re going to be pretty thick for this episode. Have a good one, folks. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making a podcast possible. Remember to subscribe in your favorite podcaster, if you like what you’ve heard. You can find our show notes at PageIttotheLimit.com, and you can reach us on Twitter @PageIt2theLimit using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.

Sustainable On-Call Culture With Paige Cruz

Transcript

Show Notes

Additional Resources

Guests

Paige Cruz (she/her)

Hosts

Kat Gaines (she/her/hers)