SRE Right Answers/Wrong Answers With Dueling Brians

Posted on Tuesday, Jul 7, 2020

In this episode Julie Gunderson plays right answers/wrong answers with Brian Weber, Sr. Site Reliability Engineer at Twitter and Brian Rutkin, Staff Site Reliability Engineer at Twitter. The guests share both the right ways and the wrong ways of doing things.

Transcript

Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson. @Julie_Gund on Twitter. Today, we’re going to talk about SRE and all the things on our special episode of Right Answers/Wrong Answers with Dueling Brian’s at Twitter. We’re joined today by Brian Weber @mistermocha. That’s M-I-S-T-E-R-M-O-C-H-A. Senior SRE at Twitter and Brian Rutkin, Staff SRE @arocknerd at Twitter as well. Thank you both for being here. To get us started, we’re going to do Right Answers/Wrong Answers, and then talk about why people feel certain ways about SRE and what it is. Brian Weber, do you want to go ahead and start us off with what is SRE?

Brian Weber: Absolutely, I’d love to. And thanks again for having us. So, I often find a lot of people struggle to try and answer, “What is SRE?” I keep thinking about the answer that I give to people like my parents who have very little idea of what it is that I do. And a lot of what I try to say is, “I’m kind of like the OSHA compliance officer for my engineering team.” They’re building the machinery, they’re building the factory, they’re doing all this stuff, but I’m the one who’s like, “It’s great that we can produce 500 cars in an hour, but there’s a pile of oily rags over there, and a dangling cable over by it. You probably want to pick some of that up.” Just because we’re fast, it doesn’t mean we’re going to be able to maintain this speed like that other factory. I kind of like try to draw that kind of analogy. I look for things that are outliers in efficiency that are health problems, and not so much like product health, but like developmental health. And generally try to impose those quote unquote standards on my team to say, “Hey, let’s try and do things this way.” There may not be an official standard, but unofficially, most of our SRE’s do things this way and maybe it’s better supported to do it this way. So, I find myself talking to other SRE’s both in the company and out of the company to try and get an idea of what those kinds of things look like. So, I often talk to Brian Rutkin, who’s also on this podcast, and he generally doesn’t tell me too many lies about the problems, about what we’re doing wrong. With that, I hand off.

Brian Rutkin: Thank you, Julie, for having us. And I’d say the number one wrong answer about SRE is, “SRE is operations.” There’s a lot of, I think, rebranding that has gone on in the last couple of years. And saying that SRE is a cool title to have, and therefore someone who does operations is suddenly SRE, and that’s not the case. Because SRE is similar, SRE is not dev ops. And that’s sort of going towards the other extreme that SRE is a subset, but it’s not quite the same. And so, taking your engineers that are dev ops and suddenly rebranding them as SREs is not necessarily the right thing, either. Those are kind of the two spectrums, and SRE kind of falls toward the middle of those in terms of how you would use them.

Julie Gunderson: Brian Rutkin, I have heard that before. And as a matter of fact, just last week, that dev ops is a concept and SRE is the implementation of dev ops. You want to talk a little bit more about that?

Brian Rutkin: Sure. I think that that’s true in some respects. SRE is an implementation of dev ops. And dev ops is the concept, roughly, that your operational work can be solved by writing software. And that’s true in many aspects of running software systems, but not a hundred percent. You’re still going to need actual people. And so, SRE is sort of the understanding that operational work is required, and the goal should be to remove absolutely as much of it as you possibly can by a human.

Julie Gunderson: Thank you. Brian Weber, you want to come back on that?

Brian Weber: Sure. So, there is a lot of misconception about just how much software development somebody in an SRE title does. And the right answer to that is, “It varies.” Now, I never studied computer science in school or college. I’m probably 80% self-taught, and I say 80% because I’ve taken a few classes here and there. I’ve also learned from a bunch of other peers and colleagues about how to do things. And the reality is I, nowadays, write very little code compared to how much other project work that I do. And a lot of it is getting back to what Brian Rutkin was saying about applied implementation. So, sometimes I am going in and patching prod code, but sometimes I’m writing a design doc on how to actually assemble a bunch of parts together. And sometimes I’m doing a bunch of research to figure out which one of our thousand config files I need to change out. So, I really don’t exactly consider myself a software developer, although I certainly spend a fair amount of time writing code. The real question that a lot of people ask is, “Just how much software do I need to know in order to do SRE?” And the thing is you just have to know the right amount, which varies. So, I’ve heard the title also, that SRE is just a software developer with a different focus, and I’m still unsure about how much of that I agree with.

Julie Gunderson: I love it. And I kind of wonder how long it would take of a podcast to really thoroughly answer that question. Right? There’s so much on it. Now, Rutkin, I want to talk to you a little bit about SLOs and SLAs. We did an episode in the past with Yuri Grinstein from Google on SLOs and SLAs. Do you want to go ahead and talk to us a little bit about the right usage of these, and how you set these metrics?

Brian Rutkin: Sure. And this is really going to vary for every organization and every service, at least by a little bit. I think that most people would agree that you want to focus on a very few number of SLOs to drive and accomplish your SLAs. So, for people who are not familiar with the topic, an SLA is a service level agreement. And usually that is the agreement that a service owner has with their customers. And that could be an external client, or it could be an internal customer that calls your service and maybe there’s no one external to your company that’s using it, it applies the same way. And then, the SLO is the service level objective. And that is an internal metric that you set for yourself and your service as to how well it’s going to perform. Oftentimes, the SLOs are also broadcast out to customers. But you generally, and almost always, but maybe there’s an exception out there that I can’t think of, want your SLO to be a stricter target than your SLA. So, that way, when you’re in violation of your SLOs, the thing you’re trying to do, it doesn’t mean that you are violating the agreement that you have with your customers. So, setting those is about knowing your service. And in general, there’s three things that we tend to look at from the service side. We want to look at the success rate, we want to look at our latency, and we want to look at our accuracy. So just succeeding in returning a call is not the same as returning the correct answer, and those can be calculated differently. So, those are often the three things that we care about. If you are lucky enough to have access to your client information, to be able to measure those things from the client side is even better because that’s who you care about, it’s serving your customer. Your agreement says what you are going to do, and your objective says, “I’m going to try and do better than that all the time, so I never violate what I’ve agreed to with my customer.”

Julie Gunderson: Thank you. And Weber, you’re up for wrong answers on SLOs and SLAs.

Brian Weber: Oh, and I got a few of those, Julie. At a previous place of employment, which I’ll leave casually anonymous, I worked with a couple of SREs where, when I asked them the question about, “Who do you think our customers are?” The response was, “We don’t have customers. What are you talking about?” The fact is, no matter what you do, if you’re managing any sort of product, whether it be Twitter, or an API to an app, or your operating system to your developers, or so on and so on and so on, you have customers. And if you’re further down the stack, your customer base gets kind of exponentially larger. People often get misconceptions, I think, about what a customer is and what it should be. The other misconception I see very often is what people should be paying attention to with respect to an SLA. Your SLA is an amount of uptime and amount of reliability. Has nothing to do with how much you’re garbage collecting, what your internal error rates are, it just has everything to do with what is your end state matter? So, a lot of people that I’ve also worked with, other teams, other products, they get crazy about monitoring absolutely every little thing and alerting on absolutely every little thing. And I’ve had some very noisy on call rotations because I’ve had contentions with teams who can’t either understand, agree on, or focus on what a specific SLA is. If you know what your SLA is, and you can focus on that, then all of your alerts come strictly off of, not just your SLAs, but your SLOs, which as you mentioned earlier, your SLO should be slightly stricter than your SLA, so that way you can catch them before you break contract. I really don’t care how much my service is garbage collecting unless it’s causing a problem. And I’ve seen plenty of wild garbage collection and wild error rates that our customers never noticed and never care about. And if they don’t care, I don’t want to care until it’s approaching a problem. Because I like sleep, as I’m sure the rest of us do.

Julie Gunderson: That was great, Brian. And one of the things that you kept talking about was alerts. Can you talk to us a little bit, and let’s do a right answer on this, on tuning those alerts and what the role of SRE is within that?

Brian Weber: Sure. When I’m creating a new alert, I’ll initially create it noisy. And I’ll initially create it noisy, maybe even just to email. So that way, I get a generally good idea on what the right tuning is before I set that alert to start waking me up in the middle of the night. I’ve come in to on-call rotations before where the previous week somebody deployed a new feature or a new service, and either they didn’t tell me, or I totally overlooked the fact that they deployed some change. And then I start getting these alerts for some mystery product that I know nothing about. So, I’ve had to go back and have conversations with people about, “Hey, this is great that we have this new product, let’s try and figure out how to make the noise level appropriate.” I spent a fair amount of time during my on call rotations, attempting to adjust and tune alerts. Because again, nobody wants to get woken up at night unless it’s a real problem. Even then, nobody wants to get woken up at night. Tuning alerts ends up being an ongoing process. That’s why it’s tuning alerts, not setting alerts. We do have hard and fast rules at Twitter for service level objectives. We have a fixed number of ideal success rate that internal services should be achieving. Everything else, ideally, is up to service owners to come up with their own decisions on what is appropriate. Which is, again, why SLOs end up becoming a slightly different thing than SLAs. So, tuning alerts ends up, in my opinion, having a lot lean on what your objectives are for your service.

Julie Gunderson: Thank you. Brian Rutkin, what have you seen go wrong here? What are some of the anti-patterns you’ve seen with tuning or not tuning those alerts?

Brian Rutkin: Yeah, certainly. There are definitely extremes that I think teams sort of go to when they don’t quite understand what their objective for alerting is. And that is a slightly different objective than the SLO objective. So, you can go the extreme of, “I never want to be woken up at night.” And you always tune your alerts so that they never fire. Or you can go the opposite way is, “I want to know about everything that happens on my service and I want it to be alerted by each of these things.” And obviously, both of those are incorrect. You definitely want to be able to catch problems with your service, but you also should be running a service that doesn’t need to tell you about every little thing that’s going on. I think, especially when we talk about SLOs versus alerts, that there’s a good way to think about them is that SLOs are the thing that tell you that you are approaching a point at which you need to take notice. You need to figure out why your service is behaving a certain way, and make adjustments. And sometimes that can be due to growth, or it can be due to a new feature, or sometimes it can be due to other things like degradation in service that you need to take care of. Alerts should be used for basically two purposes. Immediate problems. So, something like, you’re having a catastrophic failure, you should know about that right away. Your SLOs should catch that as well, but maybe you have a catastrophic failure on a node in a many node service, and you should still take care of that quickly, but hopefully it doesn’t violate your SLO if you’ve done things properly. And then, the other thing is to watch trends over time. Because you could have increasing memory pressure over time, but maybe you have things in place that remediate that and take care of it for you automatically. But that doesn’t mean your service is necessarily running correctly. So, if you can watch these things over time and say, “Why are we doing this threshold? What did we change? Or what things are different about our service?” Then the alerts will grab you there.

Brian Weber: I’d love to add something to that as well, if I may. One thing that I’ve seen in other positions and such, as well, is when alerts are not necessarily your responsibility. Meaning, you’re getting paged because somebody else is doing something that affects their service, but somehow ripples up to you. Now, when you’re in an interconnected platform, you got a stack of microservices, there’s some of that happening. But there’s also events where it’s like, long ago, I worked at a company where all of the alerts came to my team. Literally all of them. This was initially a decision that was made for having one team to pay attention to on call load, and we had shifted rotations, so it wasn’t like all night, every night. But it was for our X hours of the day that we were on call. And then afterwards, you’re off shift, you’re fine, do whatever. The on-call burden should ultimately be on the person who’s responsible for the fix. So, I’ve received alerts for other teams’ soon to expire TLS certificates, as an example. And it affected our team, but we can’t do the rotation. So, ideally, that other team should be getting those alerts. And so, I’ve had to work with some of those teams to encourage them to make that shift and own those alerts. I’ve done this with previous teams as well, where we have consulted with service owners who were sending their alerts to our central team to say, “Hey, this is great that you’re monitoring this, but you need to own this. Let’s help you on this. And also while we’re at it, let’s help you engage some better practices, so maybe these alerts don’t fire to begin with or get so noisy.”

Brian Rutkin: Yeah, I think that you touched on a good point with the things that people often do wrong. Alerts should be actionable. If there’s nothing that you can do, you probably shouldn’t be alerting on it.

Brian Weber: And that’s one of those things that I could never get a clear definition on what that meant, actionable. Because at first I thought, “Oh, you write a description of how you respond to that alert in your documentation.” But if you can have a quantifiable response, you should be able to code out a response. You should be able to have some system in place that can respond to that. And that’s where that little bit of dev ops implementation comes in about writing alert reactions and things like that on whatever automation your platform has in place, or whatever deploy system you have. Are you Kubernetes, or whatever other containers that are out there? Can you have health checks that just kill a container and restart it appropriately? But actionable does mean just what Brian had said. There has to be some required response. And eventually, out of that, then you can get into automatic responses if you have them.

Julie Gunderson: I love that. And when we talk about tuning alerts and actionable alerts, and then moving into the incident phase. Especially when you have those incidents that trigger, that shouldn’t have triggered, or the incidents that triggered that that should have. How do you handle post-mortems and sharing those learnings throughout the organization? Rutkin, I’m going to give you the right answer on that.

Brian Rutkin: Well, I wish there was a right answer. Certainly we do a lot of, I think, good practices here at Twitter. Especially in the last, I’d say even year, been a lot of focus on how to run postmortems and how to be more effective in terms of your reliability. I think the number one thing is figuring out how to communicate with your organization. Which can be hard, especially the bigger the organization gets. How do you share learning? And there are a number of things that we try and do at Twitter that I think that are good and we are continuing to try and figure out more things and better things to do. One of the things is to actually have a post-mortem for incidents. We do have a threshold at what constitute an incident big enough to require a post-mortem, and everyone should probably have that defined. What are your severity levels in terms of incidents and what triggers different responses? With your post-mortem, you should be writing up exactly what happened with a timeline. You should come out with action items as to how to fix the underlying root cause, and how to prevent it in the future. And if you are doing all this stuff correctly, you have a document that others can look at and learn from. Because a lot of things, in terms of running software systems, do end up being the same issue over and over. Sometimes in slightly different ways, but usually very similar. And it’s only the very weird ones that are only going to tickle one or two teams out of many teams. But if anyone has a solution of how you actually spread knowledge to a large community effectively, I’d love to hear them because we’re always looking to improve that.

Julie Gunderson: Well, excellent. And just to let you know, you can check out postmortems.pagerduty.com for how we recommend doing it here at PagerDuty, and sharing that and scheduling on shared calendars and whatnot. Weber, I’d love to hear the wrong answers. How have you seen postmortems go wrong?

Brian Weber: Blame. Blame, blame, blame. You screwed up. You did it wrong. What were you thinking? I’ve been on those phone calls. They’re awful. When you’re a junior engineer coming into a new environment, you can find them hilarious if you’re just observing, but it can be completely off-putting if you’re the target of that. I’ve been around this long enough that people are yelling at me because I screwed up. I’m like, “Whatever, dude. Seriously, let’s just get to a solution.” When blame happens, we can all at least do our best to diffuse it, but it often takes a whole culture to prevent that blame. Now let’s talk two seconds about Five Whys. Because we use the Five Whys model here at Twitter. What that is, is you start with what happened, why did that happen? And then, why did that why happen? And on the way down. And usually it takes about five of those why questions before you get to a real root cause. You might find other sub root causes along the way, and it reveals things that you fix. Now, I’ve actually heard talks from, Julie, some of your colleagues over PagerDuty on why Five Whys can be somewhat combative because people will say, “Well, why did this break?” Oh, I broke the code. And I feel like sometimes culture has to come in and say like, “No, the code just broke, and that’s okay. What happened? Yes, user error happens and that’s cool.” And to some degree I feel like, at Twitter, we do that pretty well about trying to lift the blame out of our Five Whys. And a lot of that just comes into, when we have our template postmortem, we have good words in there that say reasonable examples about, “Bug was pushed to production that was not caught by CI, that triggered this event, that triggered that event, and so on.” And when you get to that, hopefully if you can get down to a root cause without making it a specific human who screwed up, then you can ideally get to the best solution that way.

Brian Rutkin: Yeah. I think, to expand on that just slightly, it is rare that a single person is the cause for a problem. Unless you have someone who has intentionally gone to do something destructive, most likely, it means that the system in place was not able to catch a problem. And that’s what you really want to address as the root cause. Maybe you were allowed to push a bug to production, but why was that allowed? And then, you start looking at your systems. What things can we do to make this safer? What things can we do to make this easier or faster to remedy? All of those are the actual root causes that you want to try and get to. And that’s, if you do Five Whys properly, like Brian was saying, I think you get to those things, and you start looking at, “How do we make our systems and our culture better so that we prevent further outages in the future?”

Julie Gunderson: I like that. And that’s one of the things that we talk a lot about, too. Even going as far as to remove root cause from our vocabulary, and start talking about contributing factors. Because helping with the blame, and I like how you said root causes, right? There’s so much more to it. And, “Why was this able to happen?” Versus, “Who made it happen?” Now, we are running a little bit out of time. So, there are two things we ask every guest on this show, and I’m going to start with you, Brian Weber. What’s the one thing you wish you would have known sooner when it comes to running software in production?

Brian Weber: How to actually write code.

Julie Gunderson: Tell us more.

Brian Weber: Well, again, somehow I faked my way into this industry, well over a decade ago, and taught myself a lot along the way. Yet, I’m constantly stunned by so many of these younger, smarter engineers coming through. I struggle with imposter syndrome, just like everybody else does. But at the same time, I’m also just completely dazzled by the work that other people do and how they’re able to make a bunch of systems dance in ways that I’m not capable of doing. Now, I’m sure that that perspective is reversed as well. Because again, I’ve been around long enough to know, and I’ve had those conversations where people I work with then tell me, “Oh, no. Brian, I’m so glad you’re here. You write puppet code better than anybody. And you have that perspective on what we need that most of us don’t exactly have. And that you’ve taught me X, Y, Z.” And I’m like, “Oh, wow. I actually do know a thing or two.” The one thing that I do wish that I knew more was how to write better code in my junior years. But not so much because I feel like I’ve actually done something worthwhile with my career at this point in my life.

Julie Gunderson: Thank you for that. Now, is there anything about running software in production that you’re glad I didn’t ask you about?

Brian Weber: Maybe specific incidents that we’ve encountered. Because as much as I love talking about those, when we get to talking about specific incidences, there’s both the professional ramifications about, “Oh, you just revealed something private to our company and good grief. I wish that most companies were just better about sharing their incidents because it helps everybody learn.” And also, because I can get passionate and I can find myself, unwittingly, throwing a peer under the bus. Which I absolutely hate doing. I’m not perfect at communications, either. So, at least in the scope of a 30 minute podcast, I know that we can be more thoughtful about what we say.

Julie Gunderson: Thank you. And Brian Rutkin, what’s the one thing you wish you would’ve known sooner?

Brian Rutkin: Ah, that’s hard. There are so many things I wish I knew sooner. I think, really, the number one is that ownership is a choice, a personal choice. You can be assigned ownership of course, but that doesn’t actually mean that you take ownership. And a lot of people will think that I mean about running a service. But actually, I mean about your work in general, and that you can own lots of different things. You can own how you interact with people on your team and the rest of the people in your company, the people around you in the world. These are things that we choose to do and how we choose to do them. And really knowing that it was totally up to me on how I was going to face those things and be responsible. I think that it’s something that I shifted to doing, fairly early in my career. I started doing help desk support kind of stuff for web hosting. And so, taking ownership of problems and finding that solution was something that I gravitated to. But it wasn’t something that I was conscious of until, I don’t know, probably starting at Twitter is when I really realized how I could do that. And if someone had maybe pointed out the concept to me earlier, I think I would have done more in different places than I did.

Julie Gunderson: And what’s the one thing you’re glad I didn’t ask you about today?

Brian Rutkin: How much I like PagerDuty.

Julie Gunderson: And of course, you’re going to tell us how much you love PagerDuty. Right?

Brian Rutkin: I do actually like PagerDuty a lot. I don’t like getting pages, so there’s a little [inaudible 00:28:43] there that’s all.

Julie Gunderson: I think that’s fair. I think most people can agree they love PagerDuty, just not the phone ringing at 3:00 AM, no matter how fun those ringtones are. Brians, I want to thank both of you so much for being here with us today. And once again, it’s Brian Weber @mistermocha on Twitter, and Brian Rutkin @arocknerd on Twitter. And with that, this is Julie Gunderson wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. That’s @pageit2thelimit. Let us know what you think of this show. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Show Notes

What is SRE:

Brian Weber kicks the conversation off with an overview of Site Reliability Engineering (SRE).

Brian Weber: “I look for things that are outliers and efficiency that are health problems, not so much product health, but developmental health and try to impose those “standards” on my team. I find myself talking to other SREs, both in the company and out of the company to try and get an idea of what those kinds of things look like.”

Brian Rutkin gives us the wrong answer, discussing how SRE is more than just a cool title and how SRE is not DevOps.

Brian Rutkin: “Taking your engineers that are DevOps and suddenly rebranding them as SRE is not necessarily the right thing either. SRE kind of falls in toward the middle of those terms of how you would use them.”

Delving into DevOps and SRE

The Brians talk to us about how DevOps and SRE work together.

Brian Rutkin: “SRE is an implementation of DevOps…. SRE is the understanding that operational work is required, and the goal should be to remove absolutely as much of it as you possibly can by a human.”

Brian Weber counters with the misconception of how much software development someone with an SRE title should be doing. He continues to talk to us about applied implementation and researching components of SRE.

Metrics: SLOs and SLAs

We talk about setting and publishing Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and best practices around setting and accomplishing these.

Brian Rutkin: “This is really going to vary for every organization and every service a least by a little bit. I think that most people would agree that you want to focus on a very few number of SLO’s to drive and accomplish your SLAs.”

Rutkin continues to talk to us about setting the metrics and what you want to know from your service; success rate, latency, and accuracy.

Brian Weber discusses the misconceptions about what a customer is, what a customer should be, and what people should be paying attention to with SLAs.

Weber: “Your SLA is the amount of uptime and availability… it has everything to do with what is your end state.”

Tuning Alerts

The conversation turns to creating and tuning alerts and what the role of SRE is within that area. Brian Weber talks about how to make noise levels appropriate for new products and how tuning alerts is an ongoing process.

Weber: “Tuning alerts ends up being an ongoing process, that’s why it’s tuning alerts and not setting alerts.”

Brain Rutkin walks us through antipatterns with alert tuning and how going to extremes is a common mistake that is made. He also talks about how alerts should be used for two purposes; immediate problems and to watch trends over time.

The Brians discuss how to share learnings across the organization.

Rutkin goes over how you can determine the correct ways to communicate within your organization and define thresholds for when a postmortems is required.

Brain Weber illuminates where many postmortems go wrong, “Blame, blame, blame.” He continues to discuss blame as a main reason postmortems go wrong. Weber continues to talk about using the right words in the postmortems and how it’s rarely a single person that is the cause of a problem.

Both Brians discuss the “5 Whys” of how you can prevent future outages through systems and culture changes.

Additional Resources

Guests

Brian Rutkin

Brian is an SRE at Twitter where he works on Core Services and all the things they touch (so pretty much everything). Often that means just trying to ensure all the different services and people get along together.

Brian Weber

After coming from a non-tech background, I’ve been an SRE at Twitter for five years and had related titles for well over a decade and a half. When away from the computer, I enjoy everything outdoors and experimenting in the kitchen.

Hosts

Julie Gunderson

Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.