Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson @Julie_Gund on Twitter. Today, we’re going to be talking about monitoring with Erik Ketcham. Erik is from San Francisco and has a background working with various startups and gaming companies, and is now in financial services. Erik, you want to take a moment?
Erik Ketcham: Hey, thanks for having me, Julie. Yeah. So as you mentioned, I’ve been in and around startups in San Francisco for the last 15 years or so, currently I’m an engineering manager of a software development team as I have been for several years now. We deal with monitoring quite a bit on the application side of things. So hopefully I can have some input here and participate in a fever discussion around application monitoring and instrumentation.
Julie Gunderson: Well, thank you, Erik. To get us started, what’s your high level strategy for monitoring for maybe anybody that’s new for optimizing?
Erik Ketcham: Sure. So from a software development perspective, the main concerns we typically have are latency for requests. So you might think about how long it takes a load a webpage and everything that goes into that is basically the background of what we’re instrumenting and kind of monitoring. So a lot of times you break a request to a website down into a few different categories. One might be the latency of 50% of your traffic or 98% of your traffic or 99%. A lot of times we refer to those things as like P50, P98, P99. So it’s basically saying that 50% of your traffic get response time or better, 98% of your traffic gets that response time or 99. And a lot of times when you’re dealing with vendors and people who are consuming your services, you have, what’s considered an SLA. I think most people understand that concept as basically you’re saying your service level agreement for these response times is such that your website’s very responsive, an API call to your web service might this amount of speed. So a lot of my instrumentation of my software in my considerations for monitoring revolve around those P98, P99, P50 measurements. So if I think about my correspondence with PagerDuty, a lot of it is, hey, just so you know, your web server is down or your web servers really slowed a bit to respond to these requests, things like that. Another aspect of that might also be monitoring things like database contention, maybe you’ve written some really amazing application code and it’s just hammering your database. So you have contention on your database, or you need to have read replicas or all kinds of other things to help cache your responses. So you can have faster response times, but a lot of the things that I’m monitoring in general are going to be around web performance and kind of the full stack between someone making a request and getting a response back.
Julie Gunderson: Thank you. If you were to tell us a myth or a common misconception, what would it be that you’d like to debunk?
Erik Ketcham: I think early on in my career, I assumed that all software companies like Google or Facebook or Twitter have just massive co located servers someplace in Nevada in a desert, location or something. And that they have an insanely clustered environment that they’re working off of that’s super redundant and scalable. And I think that the reality typically is that they all started with a few servers and I’ve read some things about early days of Google and their search engine and running it off of several very over-provisioned boxes. But in general, startups are going to start with very minimal footprints, very low AWS or Google app engine bills by using as much resources as I can from a single server. Then just waiting for the money to roll in, to build the provision [inaudible 00:04:15] servers. So when I got into video games, for instance, we’d have a single server, we’d add a bunch of users at some point, we’d start getting to a point where your database is overloaded. There’s too much contention. Your app servers are falling over. You’re up all night and weekend, kind of doing all that provisioning and scalability. But initially you think, “Oh, these startups just have instant scalability, and they’re never going to fail.” But the reality is that angel investors only give you a couple hundred thousand dollars. You can only spend so much on your AWS bill every month before your burn rate is achieved. Everybody’s just kind of trying to make it to profitability. So that a lot of times means your infrastructure isn’t such that you can just scale infinitely. So I don’t think that everything is perfect. Everything is pretty wild west until you actually start making revenue from your products.
Julie Gunderson: So given that, are there ways to make it easier for folks in these environments that you want to share with us?
Erik Ketcham: Yeah, there’s a couple really well-documented places that have instrumented kind of massive changes in their infrastructure by switching programming languages and infrastructure underneath them. A good example of this is a very overused example of Bleacher Report. They do a lot of sports information services, APIs. You can go there to get up-to-date timing on like, “Oh, the giants are beating the Dodgers.” My favorite topic. And in real time you can see, “Oh, somebody got a home run.” So their API’s have to be insanely available to just millions and millions of people. Initially they were working, I think, in Ruby on Rails. And they ended up having to have hundreds and hundreds of servers to build, to keep up with that demand. So at some point they decided that their cost per AWS was too much and they weren’t making enough money and they weren’t scaling quick enough. And they were spending a lot of nights and weekends up doing all these provisioning tasks and trying to make Ruby on Rails work. They decided, “Hey, if we use the strongly type language and maybe add a virtual machine on a server that was more efficient with memory and CPU usage, we could potentially slash how many servers we have and how much scalability we can achieve.” So they switched to a technology called elixir, which is based on Erlang. It runs on the BEAM, which is an online virtual machine. There’s a famous blog post from Ben Marks, a really amazing software developer, in that they went from dozens of Ruby servers down to two or three, I think Alexa or Phoenix servers. So there’s a lot of things you can do if you choose the right technology to reduce your AWS footprint and your bill and your scalability becomes a much longer runway for you. So choosing the right technology and the right software and programming languages and scaling technology can really go a long way to reducing how much pain and anguish you go through while you’re scaling your company.
Julie Gunderson: So when you have all that pain and anguish, how do you turn that into a learning model for your organization?
Erik Ketcham: Typically, the way startups tend to work with their software development is what can we do to recruit the top talent quickly in terms of selecting a technology or a language that’s attractive to them? A lot of people choose Ruby or Python because everybody knows those languages from the software development community. If you’re graduating with a CS degree, from a school, you probably know Python or Ruby. Those languages typically scale up to a point, and then you have to make a decision about going to Kotlin or Java or Scala, or some other more type sensitive language, or strongly typed language rather. And that enables your scalability. So as you’re going through that, pain and anguish are constantly saying, “I know there’s a better way, but it’s harder to hire these people.' Or, “Maybe it’s time to start paying people more that are more experts. So you start off as a generalist software developer that knows a little bit about everything from top to bottom full stack.” And then you start realizing, “Hey, I have to staff all these experts who really know how to scale databases, or really know how to scale web services, or really understand caching at a much more fundamental level.” So the lessons you learn through that is that you have to invest in really smart people to do things. I don’t necessarily think I’m one of those people, but I’ve been working with a lot of people throughout my career that are extremely smart in very specialized areas. I think that’s one thing that you have to kind of at some point, realize as a developer, as a entrepreneur, that investing in the right people will help you kind of ease those pains that you had early on when you had very little money and now that you’re making profit, you have to reinvest that into your talent.
Julie Gunderson: That makes a lot of sense. Let me ask you, how do you bring operations into the fold?
Erik Ketcham: Operations is one of those things where again, kind of like what I said before, being a generalist at a small startup, I think everyone’s operations and you’re a software developer and your everything in between. As an engineering manager at a video game company in my past, I was up at night provisioning databases. And I think that was around the time where the word dev ops came onto the scene and we started seeing a lot of developers transitioning into the dev ops role. A lot of people thought it was a dirty word for a long time, and you were kind of expected to do software development and infrastructure and operations and things like that. So I think at some point you specialize, you make enough money to invest in operations and you actually hire some really smart people who maybe they’ve spent their time at AWS, they’ve vested their stock options. And now they’re ready to go work for a smaller company that’s growing faster or has more equity available to them. So that’s kind of the relationship you have is when you’ve done that full stack development and you’ve been operations yourself and you hire that operations' person and they’re up all night helping you provision servers at 2:00 AM. You really kind of earn this deep, deep respect for the ops teams and what they provide for you. In the early days it was just admins, we were hosting all of our own hardware in our own racks and they would go to the colo and put more Ram into the machine for you in the middle of the night. So what we have now is so much easier obviously, but those people are crucial to the operations of a successful operation of any business and obviously they’re well worth their weight and salaries.
Julie Gunderson: Well, I think that’s great. And it’s a good point when you talk about kind of being that generalist, where you have a broad understanding of everything and maybe some deep understanding in certain areas. One of the things that we often hear along with monitoring is observability. Can you talk to us a little bit how you tie those in?
Erik Ketcham: Yeah. Yeah. Observability, I think outside of what I covered before about the latency of requests and responses, the thing that I often have to think about is as people get up in the morning on the East coast, your traffic starts spiking right? Overnight, it’s very flat and very low, especially if you’re just doing business in the continental United States. So you think about things like standard deviation and setting thresholds for min and max. So imagine you’re in a video game and load up your game board and you see your castle and hit that level of 17 or something like that. But you all come in at 8:00 AM because everybody wakes up and they want to click their button to make sure that their castles building or something. So the standard deviation monitoring that people like Datadog provide, for instance, is very robust, I think new versions of [inaudible 00:11:20] also adding a lot of observability around standard deviation as well. But basically you want to be able to have monitoring in place that moves with the traffic patterns that your website normally has. So it’s going to ebb and flow throughout the day obviously. People are going to be at work and they’re going to be sneaking onto their phone to use your service or something like that. They’re going to come home from work. They’re going to have some dinner and kind of come back and re-engage, maybe that they’ve been thinking about that Amazon purchase all day long, they come home, they have a dinner, and then they go by that thing, to help you shelter in place or something like that. So standard deviation and observability of your services is really a hard thing to solve. A lot of these monitoring companies, a lot of these graphing technologies are starting to factor it in more heavily and make it much more user friendly for people who are developers, who aren’t math quants, who have math degrees to figure out the standard deviation curves and how to monitor those things. So it is a large portion of what I had to be concerned with. If every time my traffic dropped to a certain threshold, I’d get woken up every night. So it’s pretty important to have these tools. And I think that where we’re at currently and where we’re headed in the future in terms of observability monitoring with various tools that are out there right now to tie into PagerDuty are allowing us to have much more sleep at night than they ever were before.
Julie Gunderson: Yeah. And can you dive a little bit deeper into that? How do you tune these so that you’re not getting woken up? How do you get rid of the non-actionable alerts?
Erik Ketcham: The standard deviation concept is essentially that you allow for this much deviation during this much time period. So a lot of technology for graphing things is focused on this technology called time series data. So if every millisecond I record what my CPU is, what my memory is, how much storage space I have on a desk, or how much traffic this one API is getting. I should be able to see a blueprint over time after several days that the time at 2:30 in the morning, this API gets hit five times at the first second, and then maybe 2:35 it’s getting hit 15 times on average, maybe you’ve got a Cron job running in the background that’s hitting this API because you do a lot of operations at night while people are sleeping. So you start mapping out these traffic patterns and then you establish monitoring in the future based on those standard deviations from those traffic patterns. So that’s kind of the core concept is that you’ll see a bunch of spikes. Usually, most companies run some background jobs at night because they’re so low traffic and you’re paying for all those resources from AWS. So why wouldn’t you use them? So you’ll end up saying, there’s no traffic between 12 and one, but at one we run all these Cron jobs and we do a lot of DB queries or something like that, or try to purge records or do some kind of background tasks, asynchronously. So that’s a normal deviation. Don’t worry about that. But then once they were doing those Cron jobs and then a hacker from New Zealand comes in, spams our API’s with a DDoS attack, we still want to be alerted for that, even though we expect it to be high, it’s way abnormally high from what we expected. So that’s the main concept behind monitoring a standard deviation is figuring out it’s okay to have a huge spike in the middle of the night because we expect it because it’s self-induced.
Julie Gunderson: Well, thanks. And I think you’ve answered the mysterious question as to why do I always get paged at 2:30 in the morning?
Erik Ketcham: Yes, exactly. You guys never call me when it’s a good time.
Julie Gunderson: Right. Or, when it’s that meeting that you just really want to get out of.
Erik Ketcham: Right, right, right. It’s only 2:30 in the morning.
Julie Gunderson: Erik, with your background with gaming and some of the stresses in those industries, how do you kind of change how you go about things when it’s a holiday weekend and a bunch of people are going to be on? Or maybe it’s in the middle of a pandemic where everybody’s at home, is there a different way of thinking?
Erik Ketcham: Yeah, that’s a great question. I think it’s something that we consider constantly. As an engineering manager, I’m responsible for the sanity and health of my entire team. And if all we did was ship code at 5:00 PM on Friday, I would be a miserable person and so would my team. So a lot of it is just best practices and saying, “Hey, there’s a really scary migration we’re going to do to our database. And we shouldn’t do it over the weekend.” Or, sometimes you’re kind of bound to say, “Hey, we can’t do this huge migration while there’s a lot of people on our website. We have to do it at 10:00 PM on Tuesday night or something like that.” So a lot of it is just self-control and basically suffering through enough of those lost weekends and nights that you say we’re not going to deploy code during these periods. A good example of my current company is that we ramp up for black Friday, right? So there’s a lot of purchases made during black Friday, Cyber Monday. We have code phrases where you can’t commit code past a certain amount of time. You can’t deploy code in this window. We have to guarantee to our consumers that they’re going to experience a high degree of success with their checkouts or things like that. So in video games, same thing exists. You’re capturing someone’s virtual currency. And if they can’t have access to spend their rubies or their coins or whatever the [inaudible 00:16:18] are, they’re going to get really mad at you. So when you do risky things, you have to definitely have a really good migration strategy so that if something’s go bad, you can roll it back immediately. And I guess, additionally, you kind of want to make sure that the timing is right, that you have enough staff and it’s not the middle of the night to help remediate something if it breaks, right? So yeah, there’s kind of a lot of robustness you have to build into just how you plan your software deploys. Another thing that happens pretty frequently is at small startups you’re pushing code all day long. I’ve worked at startups where I have pushed code 30 times in a day, no big deal. And unless you mess something up and then suddenly, everyone’s on your back and you’re getting paged and your [inaudible 00:16:59] are going crazy. And your CEO is at your desk asking you, “What’s going on?” So in scenarios like that, some rigor around what you do, when things go bad and planning, it goes a long way. So I’ve definitely been on the hot seat where I’ve pushed back code. I think we all do it as software engineers. You learn from it and if you don’t, you’re not a good software engineer. So you typically want to have a plan to either roll back your code changes or fix it forward, which is a really bad thing to do unless you have to do something like that. So imagine you deploy some really bad code and it was because you changed and if then statement to have a greater than versus a less than symbol, right? So what do you do? Do you roll back your deploy or do you change the greater than, or less than symbol and deploy that as a fixed forward? Right? So you have to make those calls all the time. If your company has a policy that say, “We are always going to roll back, we’re never going to fix forward.” That is also really helpful. Sometimes when you do a schema migration with a database, you can’t roll back. So you have to try to fix forward. So part of making a really dangerous migration is understanding how can I make my software robust enough that I can fix it forward or revert some software that won’t utilize that new column or won’t hit something that’s not indexed anymore in your database. So it’s a lot of planning that goes into it. A lot of thinking in part of the process of not catching yourself in a bad situation, a lot of times it’s just documenting what you’re about to do, sharing it with all your coworkers and saying, am I crazy? Is this stupid? Should I not do this? Is there a more sane way to do this? And then getting buy-in so that at least when you mess up, everyone has signed off on how you messed up, basically.
Julie Gunderson: So as an engineering manager, how do you kind of drive this culture of accountability, but not necessarily fear of failure forward to your team?
Erik Ketcham: We all need to learn from mistakes that we make. And I think it’s natural everyone makes mistakes. There’s not some bar where I say, “I’ll hire you as a software engineer, but you can’t mess up ever.” We have to learn from every mistake we make. And ideally if I’m doing my job as an engineering manager, I catch things before they become mistakes, right? Or I ask questions to try and get people to think about things where I’ve made a mistake before, and I’m trying to prevent them from making a mistake. The natural ebb and flow of things works in a way that oftentimes I’m too busy to look at every line of code that goes out, right? It’s not sustainable for me as a manager to be a shepherd of all the code that gets deployed to our servers. So you have tech leads, you have other people that you empower to have that kind of leadership and mentorship. And that’s how you kind of stay sane and learn from your problems. A lot of times it’s a really good exercise after something that goes wrong, you do a post-mortem. So you document what went wrong. Maybe the servers were down for five minutes, maybe this many users at the end point got a 500 or 400 error and they did, or didn’t come back. You can quantify the downtime or the regression by how much money you may have lost. I couldn’t go into my game, I couldn’t spend my virtual currency. I couldn’t advance my castle to level 18. That’s a good way to kind of quantify what the effect was. It’s not to rub it in your face as a software developer, like, “Hey, we lost $2,000 because it was down for five minutes.” It’s a way to kind of say, “Hey, there are consequences to this action. And we need you to be a software developer and push really hard to get new features out and new things for people to do on your website or whatever you’re doing, but we have to mitigate risk as best as possible.” And there’s only one way to really learn it is by messing up. So that’s kind of my philosophy there.
Julie Gunderson: Absolutely. And the last thing you want is people to be afraid of that or they won’t innovate.
Erik Ketcham: Yeah. There’s a joke kind of between myself and some of my peers. And it’s like, “If you haven’t had a post-mortem, you’re not trying hard enough.” But no one wants one, obviously, and I know a lot of developers who have avoided it for a very long time, inevitably something minor can happen, then you can reclassify it as a post-mortem worthy problem. But if there is no user effect, it’s hard to quantify, documenting it to [inaudible 00:21:05].
Julie Gunderson: Yeah. I recently had somebody tell me that they never have any incidents because they never go down.
Erik Ketcham: They’re not trying hard enough probably.
Julie Gunderson: Right. We have a couple of questions that we ask every guest on our show. And the first one is, what’s the one thing you wish you would’ve known sooner when it comes to running software in production. And I think we touched on this a little bit earlier.
Julie Gunderson: I like it. Okay. Well, is there anything about running software in production that you’re glad I did not ask you about?
Erik Ketcham: I think that there’s a lot of political reasons why companies choose different technologies. A lot of those things are costs for a small startup, who’s got angel investing? Maybe you don’t have enough money to pay PagerDuty, or maybe you don’t have enough money to pay Datadog or Snowflake or all these other technologies that help you do all this monitoring. And thinking back to some of the bootstraps startups that I’ve worked for that have failed, you have to make decisions as a business as well. And I think I’ve been involved in some pretty scrappy endeavors in the past and the right decision is always to move fast and break things when you’ve got a very small amount of money and a big idea that you’re trying to do. And for me, I’ve made some poor decisions about the startups that I’ve joined that have not panned out at this point, but I’ve also made some really amazing connections with the software developers in San Francisco and beyond, and met some amazing mentors and technologists throughout my 15 years in San Francisco. And that isn’t worth nothing, even though I haven’t been paid out by some kind of massive public offering from a company that started. But yeah, I think that’s one thing that if you dug into my bad decision-making I think that could have been a little bit embarrassing, but overall I think, it’s still a positive and I’ve met a lot of people and experienced a lot of amazing things.
Julie Gunderson: Thank you. And Erik, from my growing reading list, what’s one book you would recommend to our listeners that is a must read?
Erik Ketcham: So this is an old book that a lot of software developers have read already, but it’s called Extreme Programming. So this is basically the methodology of agile in a way as it applies to software development. And basically, it’s like a bootcamp for how to cope with changing specs constantly. So if you’re developing something and you have a sprint schedule, that’s every two weeks you do a workload, right? Inevitably your CEO is going to come to you and say, “I just had this meeting over lunch and we need to drop everything you’re doing and work on this other thing.” Normally people don’t context change very easily. But the fact of the matter is that business opportunity is something that you have to strike while the iron’s hot. And it’s okay to drop everything and move on especially if you’ve done a lot of diligence with what you’re doing, you can come back to it later. So Extreme Programming is a good one. I think it’s by Kent Beck. I can’t remember for sure, but it came out maybe 20 years ago, I think.It’s a good survival guide for how to cope with constant change as a software developer.
Julie Gunderson: Thank you. And for everyone listening, we’ll put a link to that in the show notes and thank you, Erik, for taking the time to talk to us today, really appreciate that. For anybody that has any questions on SLAs or SLOs, you can visit the podcast episode with [inaudible 00:25:59] from Google on setting those. And again, Erik, thank you. So this is Julie Gunderson wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard, you can find our show notes at pageittothelimit.com and you can reach us on Twitter at pageit2thelimit using the number two. That’s @pageit2thelimit, let us know what you think of this show. Thank you so much for joining us. And remember uneventful days are beautiful days.
In this episode Julie Gunderson talks with Erik Ketcham, Senior Manager, Software Engineering for a financial services company.
Erik talks about his high-level strategy for monitoring for folks that are new to the practice. He talks about things you may want to think about such as: How long it takes to load a webpage, breaking down website requests into different categories, measurements, and SLA’s.
Erik: “Another aspect might also be monitoring things like database contention. Maybe you’ve written some really amazing application code and it’s just hammering your database. So you have contention on your database or you need to have read replicas or all kinds of other things to help cash your responses so you can have fast response times but a lot of the things that are monitoring in general are going to be around web performance and kind of the full stack between someone that can request and response back.”
Erik debunks the myth that when big companies started they had everything at their fingertips, when in reality startups start with very minimal footprints and are doing as much as they can from a single server.
Erk: “For instance, you know, we’d have a single server, we’d add a bunch of users. At some point we’d start getting to a point where your databases are overloaded, there’s too much contention, your app servers are falling over, you’re up all night and weekend kind of doing all that provisioning and scalability. But initially you think oh, like the startups just have instant scalability and they’re never going to fail, but the reality is that angel investors only give you a couple hundred thousand dollars”
Erik chats about how to make things easier in startups and environments with limited resources, and choosing the right resources.
Erik: “Choosing the right technology and the right software and programming languages and scaling technology can really go a long way to reducing how much pain and anguish you go through while you are scaling your company.”
Erik goes on to share ways to invest in your company learnings, by investing in the right people and reinvesting profits back into the people. He talks to us about how everyone in a startup is operations, how teams work together, and how working together with the development teams leads to ultimate success.
Erik shares how he thinks about observability and what that looks like coast-to-coast. With a background in gaming, he discusses the need to have monitoring in place and how standard deviation and observability in your services are hard things to solve.
Erik: “It’s pretty important to have these tools, and I think that where we’re at currently, and where we’re headed in the future in terms of observability monitoring with various tools that are out there right now that tie into PagerDuty are allowing us to have much more sleep at night than they ever were before.”
Erik talks about how standard deviation is the main concept behind monitoring and how to differentiate between normal deviations because of expected work that occurs in the middle of the night.
Erik: “So you’ll end up saying, hey, there’s no traffic between 12 and one, but at one we run all these cron jobs and we do a lot of DB queries or something like that. Or, you know, try to purge records or do some kind of background task asynchronously, and so that’s a normal deviation, don’t worry about that. But then once they were doing those cron jobs and then somebody, a hacker from New Zealand comes in, spams our API’s with a DDOS attack. We still want to be alerted for that even though we expect it to be high, it’s way abnormally high from what we expected. So that’s the main concept behind any monitoring and standard deviation is figuring out, it’s okay to have a huge spike in the middle of the night because we expect it because it’s self-induced.”
Erik discusses how being an engineering manager changes the way he thinks, like using code freezes during certain windows, like Black Fridays, and building in robustness when you are planning your software deploys.
Erik: “You typically want to have a plan to either rollback your code changes or fix it forward, which is a really bad thing to do unless you have to do something like that. So imagine you deploy some really bad code and it was because you changed an if-then statement to have a greater-than versus a less-than symbol. So what do you do, do you really back your t deploy or do you change the greater-than or less-than symbol and deploy that as a fixed forward? You have to make those calls all the time”.
Learning from mistakes is a major driver of a culture of accountability, Erik talks about empowering folks to learn from their mistakes and using postmortems to learn from failures.
Erik: “We have to mitigate risk as best as possible. And there’s only one way to really learn, and it’s by messing up”.
Additionally, Erik chats about how good software design and forethought go extremely far and how thinking through what you are supposed to do and communicating can go a long way.
Erik: “You’ve got speed, quality, or stability, you can have two but not three.”
He goes on to share how to think through those trade-offs and make business decisions with the least amount of risk.
I’m a software engineering manager from a non-traditional background that speaks enough human to be productive with the rest of the tech industry. I’ve traversed my career into my level of inadequacy, and look forward to moving even further down/up that path until i know absolutely nothing about what i’m doing.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.