Brian Weber kicks the conversation off with an overview of Site Reliability Engineering (SRE).
Brian Weber: “I look for things that are outliers and efficiency that are health problems, not so much product health, but developmental health and try to impose those “standards” on my team. I find myself talking to other SREs, both in the company and out of the company to try and get an idea of what those kinds of things look like.”
Brian Rutkin gives us the wrong answer, discussing how SRE is more than just a cool title and how SRE is not DevOps.
Brian Rutkin: “Taking your engineers that are DevOps and suddenly rebranding them as SRE is not necessarily the right thing either. SRE kind of falls in toward the middle of those terms of how you would use them.”
The Brians talk to us about how DevOps and SRE work together.
Brian Rutkin: “SRE is an implementation of DevOps…. SRE is the understanding that operational work is required, and the goal should be to remove absolutely as much of it as you possibly can by a human.”
Brian Weber counters with the misconception of how much software development someone with an SRE title should be doing. He continues to talk to us about applied implementation and researching components of SRE.
We talk about setting and publishing Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and best practices around setting and accomplishing these.
Brian Rutkin: “This is really going to vary for every organization and every service a least by a little bit. I think that most people would agree that you want to focus on a very few number of SLO’s to drive and accomplish your SLAs.”
Rutkin continues to talk to us about setting the metrics and what you want to know from your service; success rate, latency, and accuracy.
Brian Weber discusses the misconceptions about what a customer is, what a customer should be, and what people should be paying attention to with SLAs.
Weber: “Your SLA is the amount of uptime and availability… it has everything to do with what is your end state.”
The conversation turns to creating and tuning alerts and what the role of SRE is within that area. Brian Weber talks about how to make noise levels appropriate for new products and how tuning alerts is an ongoing process.
Weber: “Tuning alerts ends up being an ongoing process, that’s why it’s tuning alerts and not setting alerts.”
Brain Rutkin walks us through antipatterns with alert tuning and how going to extremes is a common mistake that is made. He also talks about how alerts should be used for two purposes; immediate problems and to watch trends over time.
The Brians discuss how to share learnings across the organization.
Rutkin goes over how you can determine the correct ways to communicate within your organization and define thresholds for when a postmortems is required.
Brain Weber illuminates where many postmortems go wrong, “Blame, blame, blame.” He continues to discuss blame as a main reason postmortems go wrong. Weber continues to talk about using the right words in the postmortems and how it’s rarely a single person that is the cause of a problem.
Both Brians discuss the “5 Whys” of how you can prevent future outages through systems and culture changes.
Brian is an SRE at Twitter where he works on Core Services and all the things they touch (so pretty much everything). Often that means just trying to ensure all the different services and people get along together.
After coming from a non-tech background, I’ve been an SRE at Twitter for five years and had related titles for well over a decade and a half. When away from the computer, I enjoy everything outdoors and experimenting in the kitchen.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.