In this episode Julie Gunderson talks with Erik Ketcham, Senior Manager, Software Engineering for a financial services company.
Erik talks about his high-level strategy for monitoring for folks that are new to the practice. He talks about things you may want to think about such as: How long it takes to load a webpage, breaking down website requests into different categories, measurements, and SLA’s.
Erik: “Another aspect might also be monitoring things like database contention. Maybe you’ve written some really amazing application code and it’s just hammering your database. So you have contention on your database or you need to have read replicas or all kinds of other things to help cash your responses so you can have fast response times but a lot of the things that are monitoring in general are going to be around web performance and kind of the full stack between someone that can request and response back.”
Erik debunks the myth that when big companies started they had everything at their fingertips, when in reality startups start with very minimal footprints and are doing as much as they can from a single server.
Erk: “For instance, you know, we’d have a single server, we’d add a bunch of users. At some point we’d start getting to a point where your databases are overloaded, there’s too much contention, your app servers are falling over, you’re up all night and weekend kind of doing all that provisioning and scalability. But initially you think oh, like the startups just have instant scalability and they’re never going to fail, but the reality is that angel investors only give you a couple hundred thousand dollars”
Erik chats about how to make things easier in startups and environments with limited resources, and choosing the right resources.
Erik: “Choosing the right technology and the right software and programming languages and scaling technology can really go a long way to reducing how much pain and anguish you go through while you are scaling your company.”
Erik goes on to share ways to invest in your company learnings, by investing in the right people and reinvesting profits back into the people. He talks to us about how everyone in a startup is operations, how teams work together, and how working together with the development teams leads to ultimate success.
Erik shares how he thinks about observability and what that looks like coast-to-coast. With a background in gaming, he discusses the need to have monitoring in place and how standard deviation and observability in your services are hard things to solve.
Erik: “It’s pretty important to have these tools, and I think that where we’re at currently, and where we’re headed in the future in terms of observability monitoring with various tools that are out there right now that tie into PagerDuty are allowing us to have much more sleep at night than they ever were before.”
Erik talks about how standard deviation is the main concept behind monitoring and how to differentiate between normal deviations because of expected work that occurs in the middle of the night.
Erik: “So you’ll end up saying, hey, there’s no traffic between 12 and one, but at one we run all these cron jobs and we do a lot of DB queries or something like that. Or, you know, try to purge records or do some kind of background task asynchronously, and so that’s a normal deviation, don’t worry about that. But then once they were doing those cron jobs and then somebody, a hacker from New Zealand comes in, spams our API’s with a DDOS attack. We still want to be alerted for that even though we expect it to be high, it’s way abnormally high from what we expected. So that’s the main concept behind any monitoring and standard deviation is figuring out, it’s okay to have a huge spike in the middle of the night because we expect it because it’s self-induced.”
Erik discusses how being an engineering manager changes the way he thinks, like using code freezes during certain windows, like Black Fridays, and building in robustness when you are planning your software deploys.
Erik: “You typically want to have a plan to either rollback your code changes or fix it forward, which is a really bad thing to do unless you have to do something like that. So imagine you deploy some really bad code and it was because you changed an if-then statement to have a greater-than versus a less-than symbol. So what do you do, do you really back your t deploy or do you change the greater-than or less-than symbol and deploy that as a fixed forward? You have to make those calls all the time”.
Learning from mistakes is a major driver of a culture of accountability, Erik talks about empowering folks to learn from their mistakes and using postmortems to learn from failures.
Erik: “We have to mitigate risk as best as possible. And there’s only one way to really learn, and it’s by messing up”.
Additionally, Erik chats about how good software design and forethought go extremely far and how thinking through what you are supposed to do and communicating can go a long way.
Erik: “You’ve got speed, quality, or stability, you can have two but not three.”
He goes on to share how to think through those trade-offs and make business decisions with the least amount of risk.
I’m a software engineering manager from a non-traditional background that speaks enough human to be productive with the rest of the tech industry. I’ve traversed my career into my level of inadequacy, and look forward to moving even further down/up that path until i know absolutely nothing about what i’m doing.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.