Yuri discusses the history of alerting and why alerting should be on things that impact the customer experience.
Yuri: “It’s one of the soapboxes that I find myself on when talking to customers. It so often happens that customers will look for help with alerting; ‘How will I know if I am having an issue with my infrastructure, how will I know if I have high memory consumption?’ You should never alert on things in your infrastructure, you should only alert on things that impact your customer experience.”
Yuri and Julie discuss common mistakes customers make dating back to the beginning of alerting and how products like PagerDuty have changed the way alerting should be done.
Yuri and Julie discuss where PagerDuty came from and where it is today.
Yuri: “The fundamental problem that you are trying to solve, which is like hey I get 1,000 alerts an hour and I don’t know which ones are important. That’s not the problem we should be buying PagerDuty to solve, that’s the problem we should be addressing at the root.”
Julie and Yuri continue to discuss the issues with email filtering and essentially “training people to ignore alerts”
Yuri talks about how it’s about the service owner and who is ultimately accountable for reliability of that system and what user happiness means in this context. He continues to discuss SLI’s and SLO’s.
Yuri: “We use SLI’s as a proxy for user happiness.”
Yuri and Julie discuss setting up alerting with SLI’s and SLO’s in mind along with alerts needing to be humanly actionable, with a little bit of error budgets sprinkled in.
Yuri: “Things that are not directly contributing to or impacting user happiness, those should be created as tickets in a ticketing system… there is no need to wake someone in the middle of the night.”
Continued discussion around when alerting should wake someone up in the middle of the night.
Julie: “Making sure every alert that wakes a human up is humanly actionable sounds great but isn’t always easy, and it comes down to fine tuning. Do you have recommendations?”
Yuri: “People often feel that if they don’t have an alert for it, it’s not actually happening..”
Julie: “Let’s go deeper on how do we really dig deep into what the customer experience means when you are looking at service level indicators and service level objectives”
Yuri: “You have to have a good understanding of what are people actually trying to do, and then some way of quantifying.”
The discussion continues on what metrics we use to quantify customer success and performance.
Yuri: “The closer we are able to collect this information to the customer, the more accurate it is going to be.”
Shifting topics to how technical debt is expressed as a gap in knowledge, and how people treat their systems as a black box.
Yuri: “You’ll hear the term ‘black box monitoring’ because they don’t actually know how it works.”
Continued discussion on how technical debt manifests itself in monitoring.
Yuri discusses what service ownership means to him and how it is really “engineer empowerment”, and what that means to reliability.
Yuri Grinshteyn is a Customer Engineering Specialist at Google Cloud, where he works with customers to help them design reliable architectures and advocates for SRE practices and principles.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.