Working With SLOs With Alex Hidalgo

Posted on Tuesday, Apr 5, 2022
Service Level Objectives (SLOs) are a method for focusing work on reliability. As a tool for your team, SLOs provide insight into service performance and can act as a framework for prioritizing tasks and features.

Transcript

Mandi Walls: Welcome to Page it to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting this system. I’m your host, Mandi Walls. Find me @lnxchk on Twitter. Hey, here we are. Welcome back to Page it to the Limit. This week I have with me Alex Hidalgo. Alex is kind of a guy you might know from Twitter, but also is a principal reliability advocate at Nobl9, and the author of, Implementing Service Level Objectives from O’Reilly. Welcome to the show, Alex.

Alex Hidalgo: Thanks for having me, Mandi. I don’t know what it quite means that the first thing you go to is a guy you might know from Twitter, but.

Mandi Walls: Everybody seems to know you from Twitter, right? The sort of DevOps environment on Twitter is very closely knit.

Alex Hidalgo: That’s true. No, it’s a fun group there. Actually, I found during the pandemic, it’s really also been a good set of friends to have, honestly.

Mandi Walls: Yeah, absolutely. I’ve completely given up on Facebook because that’s very toxic. But Twitter friends are awesome. Yeah. All right. Well, get us started. You’ve been working on SLO stuff, that’s what Nobl9 does. Give us the baseline, like what are SLOs? And why should folks be thinking about this? What are they going to do with them?

Alex Hidalgo: Sure. So at their absolute basic, SLOs are service level objectives. They’re just a way of admitting to yourself that nothing’s ever perfect. That’s really at the absolute root. It’s just a codification of the concept of a, don’t let great be the enemy of the good. Nothing’s ever a hundred percent, nothing ever succeeds all the time, failures happen. And SLO based approaches to say, let’s embrace this and let’s pick a reasonable target instead. Since we know we can’t hit a hundred percent, we admit that at some point in time, something’s going to fail about our computer services. We’re going to have an incidence, a dependency’s going to break on this, whatever it might be. And so instead let’s pick a more reasonable target. And that’s what a SLO really is. It’s picking a percentage, a reasonable percentage for how often you need to be operating reliably to hopefully have a good balance between what your users, what your customers need, versus what keeps your teams and your finances healthy.

Mandi Walls: For those of us at a certain age, I guess, there was this dream of the five nines. And like, this feels like an evolution, but also like an acceptance of reality that like we’re not running landline telephones anymore.

Alex Hidalgo: Yeah, totally. I mean, five nines, what’s that means for everyone that doesn’t know, it means you’re aiming for 99.999%. Whether that’s an availability target, or you only want one out of a hundred thousand requests to have an error, whatever it might be. But I always love translating these into time.

Mandi Walls: Oh, yeah.

Alex Hidalgo: Because that kind of helps people kind of really wrap their head around what this means. And five nines or 99.999%, if we were to turn that into an availability target, would be approximately six seconds of potential downtime per week. Or about five minutes per year. So when you really start to think about it that way, is that a reasonable target? Can you actually run computer services that actually hit that target in any kind of meaningful way for your users? And even if you can hit that, do your users even need that? Do they even require that? Or are these just kind of meaningful, endless nines you’re stacking on top of your target that really no one’s expecting of you anyway?

Mandi Walls: Yeah. I feel like, in the place where like, yeah, we have a lot more technology now and it’s like part of all of our life, like people have become a lot more … I won’t say necessarily forgiving. But maybe accepting that occasionally there’s going to be a blip, that there’s going to be a time when things don’t respond as fast as possible. And maybe things are a little bit unexpected. And that seems very refreshing versus being screened out for every.

Alex Hidalgo: Yeah. Humans are used to failure. I have a million computer examples, but I also like the example, I worked in the service industry for a long time. And I think one of the reasons why SLOs speak to me so much is because I realized like we did all this at restaurants.

Mandi Walls: Okay.

Alex Hidalgo: Like whether I was a server on the floor or I was a cook in the kitchen, we knew that we weren’t going to be perfect. And that was fine, as long as we were good, often enough. And the example I like to give is you’re in a restaurant and you order a pizza. And you ordered a pepperoni pizza, but instead cheese pizza comes out. That’s a failure. It didn’t come out exactly how you requested it. This is like sending a request to an API and you’re getting a slightly wrong response back. But it still came out quickly, so the latency wasn’t bad. And it still came out, so your request wasn’t lost entirely. And now you have a choice. You can either accept this minor failure and just eat the cheese pizza, which you also still like. You wanted pepperoni on it, but that’s fine, you also like cheese pizza. Or you can just retry to literally like retrying a request to an API. That you can send it back to the kitchen and they can add pepperoni to it and send it right back out. And you’re fine with that. You’re not going to like never go back to that restaurant again, just because they made this one tiny little mistake. And our computer services are just like that. Once you introduce humans to it, they’re fine.

Mandi Walls: All bets are off if people are weird.

Alex Hidalgo: Yeah. And you just got to make sure that you’re balancing things well enough that you’re not losing your users. You don’t want to be terribly unreliable. But trying to ensure you’re sending out the exact correct pizza every single time, it’s just not going to happen.

Mandi Walls: So with that, like what you mentioned there, how does your team know what users find most important? Are you doing experiments? Are you doing some AB testing? Are you doing surveys? Like what kinds of tools are there for folks who want to find their most optimal objectives there?

Alex Hidalgo: Yeah. This is actually the most difficult part. It’s very simple to sit here and talk about, well, like don’t aim for a hundred percent. But then, well what should we aim for? I don’t know. That’s actually very difficult to figure out.

Mandi Walls: Yeah.

Alex Hidalgo: I hate always going straight towards, it depends, but it depends. It depends on what your service is, it depends on what your users expect, it depends on what kind of business you are. Are you charging people or not? People are a lot more flexible dealing with failures for something that they know is free versus something that they’re paying a premium for. Even just like streaming service, you often have different tiers available. Your users who are paying you a premium for the top tier service are going to have much stricter requirements than those paying you for the lower tier service. So, there’s a ton that goes into that. And how do you figure that out? The important part is just to be thoughtful, I think. Like I don’t have super great advice that is entirely applicable to every single person across the board besides be thoughtful. Think about what is our situation? What do our users look like? Can we send them a survey? Are they even going to respond to that survey? Maybe we need to do interviews. And if we do interviews, maybe we need to compensate our users for that. Because an interview can be lengthy session. So maybe we got to give them some kind of gift card or a free month. It really is pretty wide open, just be thoughtful and make sure you are thinking about that. Because too many people go down this road and adopt this kind of approach and be like, “Okay, cool. We’re doing SLOs now.” So we’re going to aim for 99.999% latency blow, 400 milliseconds. What does any of that mean without context? It means nothing.

Mandi Walls: Yeah. And it’s interesting too, like in some ways people kind of ask like either they want to be told what the right objectives are to set. Or they want the freedom to kind of go crazy and experiment and find that. It’s hard to find that like nice middle place where, okay, these are the things that fit for us in this team and this application. And there’s no sort of right way to do it and helps sort of lead people in that right direction.

Alex Hidalgo: Yeah. A thing that I often have to remind people of is that, an SLO based approach to reliability is that. That’s one of the reasons I use that term. It’s an approach, it’s a different way of thinking about your services. It’s a better way to have some pre done math, that takes some of the telemetry you have, hopefully makes it a little bit more understandable. And you then can use that data to help you make better decisions. But there’s nothing about out this that is like, it’s not a one and done thing. It’s not like you set SLOs and now we go do something else. No, you set SLOs because you want to measure your service in a different way. And if it turns out a week later or a month later or a year later, that what you picked to measure things, wasn’t the right thing. Then you go change that. They’re not agreements, they’re not SLAs.

Mandi Walls: Yes.

Alex Hidalgo: Like people have heard of service level agreements, these things written into contracts that people are beholden to. And that’s not what an SLO is. It’s explicitly not an agreement. So go change them if needed. They’re constantly evolving, they’re constantly iterative. So, you don’t have to worry about whether or not what you picked the first time around was perfect or not. As long as what you’re picking isn’t so terrible that you’re driving users away or so terrible that your business is going under, you’re not going to get the right target the first time around. You might even say there is no such thing as a perfect target, because reality is constantly changing and that’s all fine. The point is, are we taking the time to think about things in the right way?

Mandi Walls: Yeah. I think that’s super interesting too. Like people do kind of get mixed up with like the SLA, the service level agreement. There’s lawyers involved, there might be fiduciary responsibilities there, you publish those on your website and that it’s a very public contract. Your SLOs are the things that your team is promising almost to yourselves. Like that’s the goals that you’re working towards with your team. And maybe you have like some responsibility to your dependent services and things like that, but it’s all internal.

Alex Hidalgo: Yeah, exactly. I don’t really like talking SLAs because like they’re not my thing. Because as you just mentioned-

Mandi Walls: We’re not lawyers.

Alex Hidalgo: Yeah like they’re about lawyers. A lot of people like don’t, I think, realize SLAs exist so that you, as a customer, have an excuse to break your contract.

Mandi Walls: Mm-hmm (affirmative).

Alex Hidalgo: That’s what they’re really there for. No one really cares about the minor amount of credit a major cloud provider gives you if they exceeds their … no. Like no one actually cares that “financial benefit”. No, it’s there so that you can break your contract with a vendor if they suck too much. And that’s all an SLA actually is. Now turns out we can take kind of the concepts behind an SLA, which is not aiming for a hundred percent. And use it for good via SLOs, but they’re really very different to my minds.

Mandi Walls: Right. You’ve got different stakeholders. When you’re talking SLOs and when a team is looking to sort of engage in that process, who do you include in that discussion? Who all are included in your stakeholders? You’ve got obviously your users that are kind of maybe your silent partner, if you’re not talking to them directly. But like who else internally is part of that discussion or should be?

Alex Hidalgo: It can be as broad as possible. At the end of the day, SLOs are a better communication tool. Like they’re better data to help you make decisions, but also communicate out what those decisions might be. And therefore, really anyone who may consider themselves a stakeholder in any way. So it could be teams that depend on your service, it could be the product and project managers that care about your service, all the way up to the business side of things and the C level. Anyone who may have investment, even if they don’t realize it, should be able to know what your SLOs are. I’m a big fan of them being highly public and [inaudible 00:11:37] transparency see there. But, instead of the huge version that I just laid out, you can also just say, it’s just for your own team to make decisions, for your own team to decide, okay, what should we focus on next sprint? Maybe it’s for a closely aligned sister team who relies on your service. So it doesn’t have to be the kumbaya version of the entire company is buying and everyone cares about it. It could also just be like a small scale thing. Again, it depends entirely on your situation and who cares about things. What I do think is important to know is that always your user.

Mandi Walls: Mm-hmm (affirmative).

Alex Hidalgo: It’s always at least you and your user, at least.

Mandi Walls: Okay.

Alex Hidalgo: And we often say user, but what that really means is anything that depends on your service. So that doesn’t have to be a human, it doesn’t have to be a paying customer. It might be a team down the hall, it might be a third party application. It could be anything that talks to your service as what you got to think of as the user. So while you’re picking your targets or you’re picking your measurements, make sure you at least have them in mind.

Mandi Walls: Yeah. Especially for folks who run backend services, like their choices of SLO will impact how rigorous anyone who depends on them can make their SLOs. So like having that discussion between teams would be necessary almost. Like the front end team can’t set a higher SLO then I can provide them from a dependent service and something like that.

Alex Hidalgo: Yeah. I mean, luckily it turns out that people like this approach in general and places where I’ve helped introduce it and places where I’ve seen it introduced, it generally grows. And you have people kind of come to this realization themselves that, “Oh, I’m, depending on this service and they’re only promising that many nine’s, so I better pick this other target.” Yeah. Like that’s the cool thing is people generally pick up on this themselves. But also in a perfect world, like, yeah, you start from the bottom up, you say, “Okay, here’s my infrastructure. Like here are the targets that we’ve believe we can hit and that we can reasonably aim for.” And I even talk about it in an absolutely perfect world that would include the people running our hardware first. Or what’s the reliability of the circuits delivering electricity to the racks that your servers reside in? What is the reliability targets of the cooling units in the data center? That might be mostly a hypothetical thought experiment more than it’s realistic. But that’s the way you got to think about things. Everything depends on something else at some level. There really isn’t anything that we can’t jump one further level down from, in terms of abstractions. So you can’t go all the way to the bottom, but make sure you’re thinking about that enough steps back. Even if you are like, “Okay, cool. I’m the infrastructure team. I provide Kubernetes running on bare metal for my entire company.” Well, who’s taking care of that bare metal? Who’s taking care of the power delivery to that bare metal? Make sure you’re thinking about those kind of things. Because unless you understand, or have some, at least, vague understanding of what the reliability of those dependencies might be, it can be really difficult for you to promise anything.

Mandi Walls: Yeah, definitely. And data center design definitely gets into all that wacky business. I haven’t stepped in a data center, I think since we worked together a million years ago, which I’m not sure I’ve missed it, honestly. But yeah.

Alex Hidalgo: Every once in a while, I have the itch to be really chilly for the day and un-rack some things. Yeah, like every once in awhile.

Mandi Walls: Try really hard not to drop them. And yeah.

Alex Hidalgo: I still have a tool box about two rooms away that I rescued from the New York City data center.

Mandi Walls: Oh man.

Alex Hidalgo: Yeah. Like I might be the last person on the planet with like an actual physical remnant of Admeld in that sense-

Mandi Walls: Probably.

Alex Hidalgo: Which for the audience, is the company Mandi and I used to work at together a million years ago.

Mandi Walls: A million years ago. I mean, I still have my crimper and I still have my wire cutters and my little star screwdrivers and that stuff. But like everything else, I’m just like, don’t want to do this anymore. But yeah, definitely like thinking about like what your providers are going to give you. Oh my gosh. Those are dark days, man. I don’t know. So along with SLOs, like the other part, the sort of mirror image of that is what folks call error budgets. And the pieces that are like your flexibility, I guess, where your errors go, that you budget for your errors. Tell me a little bit about how that works and how that like … what folks use those for. You mentioned a little bit about setting priority for work and stuff like that. How does what’s sort of left over after you’ve picked your SLO? How does that help you out?

Alex Hidalgo: Yeah. So as you just alluded to, like an error budget is the opposite. So if you have an SLO target, real easy number 99%. You want to be 99% reliable, whether or not that’s availability, error rates, the data correctness, whatever. You have this target and your service wants to hit that target. And you have decided that it’s fine if you aren’t reliable that 1% of the time. That’s what you’re also saying. If you’re saying 99% is good enough, you’re also implicitly saying 1% bad is cool with us. And once you’ve established that, you now can say, “Well, we have this 1%.” Like this is now our budget. This is now an amount of time or account of events that can go wrong, that we don’t care about, really. We don’t panic until we exceed what this budget gives us. So now you have this period of time or again, number of events, that you can do all sorts of things with. Either one, you just use it to absorb the natural failures that occur with your surface. And that’s what I think most people and most SLOs kind of exist for.

Mandi Walls: Okay.

Alex Hidalgo: But you can all also say, “Okay, cool. We’ve been running exceptionally well for some amount of time. So we have all this budget remaining because we’ve running at approximately a hundred percent and we were really only aiming for 99%.” And that now means maybe we can do some experimentation. Maybe this is a good time for KS engineering. It’s even down to the tiny things. Like, have you ever wonder what would happen to your service if you switched the garbage collection method like on your JVM? Like I don’t know, go find out.

Mandi Walls: It could catch fire, it could be great. Yeah, let’s do it.

Alex Hidalgo: Right. Exactly. And if you have error budget, it’s a great time to say, “Cool, let me just flip that flag and find out because you know what? Everything’s been fine.” It’s a signal. It tells you whether or not now is the right time to do things. And on the inverse, a great story before I was at Nobl9, I was at Squarespace for a while. And that spent a few months planning on a data center, black hole exercise.

Mandi Walls: Oh.

Alex Hidalgo: They were going to turn one of the data centers entirely off. They hadn’t actually done this before. And it was a big deal, coordination across teams, everywhere, across org. Super huge effort. And just like the week before like there’s an outage and it wasn’t anyone’s fault. It was a DDoS, the kind of thing that happens to a website company like Squarespace every once in a while. And so I had to go to the organizers of this, like they were having their last meeting to have their black wall exercise. I’m like, “We can’t. We don’t have error budget. We just had an outage last week. We can’t right now.” And I remember the person organizing this, “Dammit, Alex. You’re right.” Because it can also be a signal that tells you the exact opposite, let’s not experiment right now. Let’s not have this black hole exercise, let’s not find out what happens when. That’s a really fun thing and I think it’s very important to find out what happens when, for various aspects of your services. But you do it at the right time. And an error budget either tells you, “Yes, we have error budget, let’s find out right now.” Or, “No, we’re out of error budget. Let’s wait a bit because we have users who depend on us.”

Mandi Walls: Mm-hmm (affirmative). No, that’s super interesting. I wasn’t even thinking about it from that particular perspective. But yeah. Bring everything that can induce instability kind of to a more controlled piece [crosstalk 00:19:49]-

Alex Hidalgo: The stereotypical example that’s in the Google SRE books is, have error budget ship features, out of a budget stop shipping features, focus on reliability.

Mandi Walls: Yeah.

Alex Hidalgo: It’s overly simplistic, I don’t love it because it doesn’t give you enough room. And I think it incorrectly kind of pigeon holes reliability work is not being project work, which I don’t think is correct at all. Like that is feature work.

Mandi Walls: Yeah.

Alex Hidalgo: So, I don’t love that example for that reason because it’s too simplistic. But a better way to think about it is minimize things that may bring instability or periods of unreliability to your service if you’re out of budget. Because that budget is telling you, maybe we’ve sustained enough instability recently.

Mandi Walls: Cool. And with those, does the timeframe also … when you’re sending SLOs and you’re sort of picking your timeframes, you’re looking across maybe days or weeks or whatever. Is that sort of the same practice as you’re picking the SLO, like picking that timeframe a bit of experimentation or just figuring things out? Or when do you reset?

Alex Hidalgo: Yeah. Some people do rolling windows and some people do calendar lines. So some people maybe they’re defending an SLA, so they do have this SLA. So they set a slightly more stringent SLO and they need that tied to a calendar month because our contracts are always-

Mandi Walls: Okay.

Alex Hidalgo: Calendar based. Literally the month of March, the month of April. But I like rolling windows better. Pick 28 days, pick 30 days-

Mandi Walls: Okay.

Alex Hidalgo: And literally have that as a rolling window, moving forward into time. And allow bad events or bad time to drop out, like off of the back. And so you kind of recover your budget as enough time has passed. I like about month periods just because they seem to work well, humans are used to measuring things in months. So I don’t think there’s necessarily some strong mathematical reason to pick it outside of the fact that people are good with it. A lot of people pick maybe two weeks or 28 days because that might line up with multiple sprints. But it’s like so many other things, just make sure that what you’re picking, you’re at least being partially thoughtful about. Does it line up for you? Does it work for you, your teams, your business, your organization? But you do want to give yourself enough time to reasonably measure things, too. I don’t see a ton of benefit in a single day budget, for example. Maybe we can come some niche, hypothetical situations. But do give yourself enough time that you can meaningfully look back and say, how have things been over this time period? Maybe even think about it in the sense of, has enough time passed? Is this window large enough that if something goes really wrong, that once it drops out, are our users no longer mad at us? That might actually be a reasonable [crosstalk 00:22:30]-

Mandi Walls: User memory could be part of your window. Yeah. Cool. So, Nobl9, like this is part of what you guys do.

Alex Hidalgo: Mm-hmm (affirmative).

Mandi Walls: How are you sort of bringing this to your customers? What’s your practice like that you’re helping folks with there, with the product line?

Alex Hidalgo: Yeah. So Nobl9, we aim to be able to basically ingest data from anywhere, do SLO and error budget math against it, and then give you the best data you possibly can out of that. Whether or not it’s automated alerts or talking to your CICD system to either halt or release releases, or literally paging people, if things are really bad.

Mandi Walls: Mm-hmm (affirmative).

Alex Hidalgo: We’re just there to kind of help you do that math better because-

Mandi Walls: Oh, okay.

Alex Hidalgo: Some platforms may do it in a rudimentary sense. Most platforms don’t do it at all. And companies keep having to build their own internal SLO tooling over and over and over again. And we’re really there to ensure people don’t have to do that anymore. So we kind of aim to who help you take data from anywhere, whatever might become a useful SLO. Because it’s not just your time series monitoring data, that’s where people often start. But you don’t have to limit yourself, just like the cheese pizza example we had earlier. The data that you may want to set reliability targets around, it may surprise you where that lives. There’s a lot of great business data I like that you can use for this. Again, it doesn’t just have to be your operational telemetry. And we just aim to help you better delve into that data by doing the math for you.

Mandi Walls: Cool. That seems super helpful out there for folks. And we’ll put a link to the Nobl9 website in the show notes for folks who aren’t familiar with the company and can check that out. One maybe spicy question I have for you is like with this sort of more sophisticated process, does this help folks get away from things like code freezes and the don’t release on Friday practices? And some of those other constraints that have been sort of plastered on in places to protect the reliability of systems?

Alex Hidalgo: I think it can. Ultimately those kind of practices are all about the culture that has been built at an organization. And I don’t want to say that just introducing SLOs will instantly make the fear of a Friday release go away. Or the practice of having code freezes during holidays go away. But SLOs do give you more reasonable reliability data to look at-

Mandi Walls: Mm-hmm (affirmative).

Alex Hidalgo: Because like error count don’t mean anything unless you understand what amount of errors can be tolerated. Like until you have a target, until you have some understanding of what can our failures look like? What can our unavailability look like? How many incidents can we sustain as a business? How many outages can our users handle? Until you have an understanding of what any of those things are, you’re never going to resolve the systemic social technical issues that underlie, don’t deploy on Fridays. That underlie, we must have a code freeze. Like there’s not necessarily a one to one like relationship. But I do think embracing an SLO based culture is a step towards embracing a culture that understands it’s actually okay to release whenever you want. That maybe code freezes are not always great. I’m actually more of a fan of code freezes than people may think, actually-

Mandi Walls: Yeah.

Alex Hidalgo: I think for certain businesses, it makes total sense.

Mandi Walls: Okay.

Alex Hidalgo: If you’re like a retailer, why not freeze for a week before Black Friday?

Mandi Walls: Yeah. Take it off.

Alex Hidalgo: But on the same token, like you should release whenever is … here’s a better way to put it. I’m getting slightly tangential.

Mandi Walls: Okay.

Alex Hidalgo: It’s got nothing to do with releasing on Fridays or not. It’s got to do with, do you feel safe releasing when you need to?

Mandi Walls: Yeah.

Alex Hidalgo: And you know what? That also means if you don’t want to release for a week before Black Friday, don’t, as far as I’m concerned. Feeling okay about when you want to release includes when you don’t want to release, as well as when you do want to release. Like that’s my take on it, I think.

Mandi Walls: And I totally agree. And I think some folks have been using … code freeze is not necessarily … maybe not malicious. But like in an irresponsible way. Where they know it’s coming, the calendar exists and you’ve planned it ahead and you know what those holidays are. But then someone wakes up in the middle of October, all surprised. And it’s like, “Oh wait, we wanted to release this thing. And it turns out it hits in the middle of code freeze week. We’re going to have to rush and get this done.” No, you need to plan better or have a slightly different model for what you’re doing.

Alex Hidalgo: Right. Exactly. Have as many code freezes as you want, as long as you’re also shipping at the right times. Yeah. Anyway.

Mandi Walls: Yeah. Interesting dysfunctions out there with some of those folks. One question we like to ask, is there a myth that you often find yourself debunking about SLOs that you can share with us?

Alex Hidalgo: I mean, a ton, there’s a lot of, I think, just general misconceptions. People who think it’s actually good to aim for more nines, people who think that’s the goal is to have as many nines as possible. People who seem to think you can only use the number nine, which is also confusing to me, because there’s nine other numbers beyond nine that you can be using for your reliability targets. Like maybe 98.7% is what should actually be aiming for, who knows? Your service is unique, you tell me. But I think the most pervasive is that people think of it as an OKR, a quarterly goal, a thing-

Mandi Walls: Yeah, okay.

Alex Hidalgo: We do once. We now have SLOs, yay. No, it’s a totally different way of thinking. It’s a different way of measuring. It does not end. It is closer to something like using agile to playing your sprints than it is just a checkbox. It’s a different kind of data, you need to be essentially perpetually using to help you make decisions. The best ways I’ve seen, like people love talking about how budget burn base alerting is superior to threshold alerting. And I think it can be, if done right. But the most useful way I’ve seen SLOs used over my many years focus on this, is teams that just look at them once a week. They have their weekly sync and part of their meeting is they go look at older error budgets for older services. And they say, “Huh, do you see that one? It’s kind of going down. Maybe someone should look at that.” It’s that, that is really the most useful, beneficial way of using SLOs, I think. They give you a million things, but above all, they give you a signal, maybe we need to look into that. That’s only happens if it’s a new thing that you add perpetually to your signals, that you add perpetually to your planning. It’s not just a thing you check off a list.

Mandi Walls: Awesome. And is there anything else that you want to mention that we haven’t covered yet? We’re almost at the end of our time. So any parting thoughts?

Alex Hidalgo: Just people should register and check out SLO Conf happening next month, May 9th through 12th. Come check it out, registration’s free. It’s entirely virtual, speaker list should be out by the time you’re listening to this. We should have dozens and dozens of great talks. They’re all just five to 10 minutes long. And they’re explicitly meant to be a consumable while you work, tiny little snack size tidbits of SLO wisdom that you can kind of absorb in between meetings. So come check us out at sloconf.com.

Mandi Walls: Absolutely. We will put a link to that in the show notes for folks to check that out. And you guys record those and they’re posted and available afterwards, as well, so you don’t want to miss out on all that good stuff. If you’re interested in SLOs and if you’re not ready, come back to them when you are, they’ll be there and that’ll be great. Well, Alex, thank you for joining us this week.

Alex Hidalgo: Yeah. Thanks for having me, Mandi. I had an absolute blast.

Mandi Walls: Yes. Bit of a walk down memory lane. Makes you feel a little old on a morning, but it’s fine. It’s all good. So like I mentioned, we’ll have links to Nobl9 and to Alex’s book and to SLO Conf in the show notes out there, make sure you sign up for that. Upcoming from PagerDuty, as well, should be out by the time this gets released. We’ll have our PagerDuty Summit announced for June of this year. And we hope to see everybody at those, as well. So in the meantime, we’ll be wishing you an uneventful day. That does it for another installment of Page it to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. Thank you so much for joining us and remember, uneventful days are beautiful days.

Show Notes

Additional Resources

Guests

Alex Hidalgo

Alex Hidalgo (he/him)

Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of Implementing Service Level Objectives. During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex’s previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Hosts

Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.