Quintessence Anx: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Quintessence, or QuintessenceAnx on Twitter.
Quintessence Anx: Today, we’re going to be discussing understandability. Understandability, in this context, is how easy it is for an engineer to comprehend a given system. Among other things, this can have a huge impact on how much time and effort is required to diagnose and resolve service disruptions.
Quintessence Anx: Today, we are joined by our guest, Liran, the co-founder and CTO of RookOut. He’s an advocate of modern software methodologies like Agile, Lean, and DevOps. Liran’s passion is to understand how software actually works. Thank you for joining us today and welcome to the show.
Liran Haimovitch: Hi, Q. It’s great being here.
Quintessence Anx: Awesome. Can you tell us a little bit more about what software understandability is?
Liran Haimovitch: Working with our customers at RookOut, we came to realize that software engineers are always trying to understand, to comprehend, how the software is built and how it’s doing. In simple terms, it’s the age-old question for every software developer, what my code is doing and why is it doing whatever it’s doing? As software systems evolve, they age. As more and more people gets involved in them, then those questions become harder. We’re finding that software engineers quite often spend more time trying to understand what is going on and why, than they do spend writing new features, developing, writing code, or even fixing bugs. They just spend so much time trying to comprehend what’s going on.
Liran Haimovitch: That’s what software understandability is all about. It’s about empowering engineers to understand the systems so they can just go ahead and do their jobs and deliver value for the users.
Quintessence Anx: That’s awesome. Have you encountered any myths or misconceptions while you’ve set out working in this area?
Liran Haimovitch: Working with this area, we’ve seen the misconception that complexity is synonymous with lack of understandability. If a system is complex, it’s going to be hard to understand, and if it’s simple, it’s going to be easy to understand. That’s so not true. There are many things you can do besides changing the complexity of the system to make it more understandable. That’s fortunate, because often, changing the complexity of the system is pretty hard.
Liran Haimovitch: I think the most obvious example to that is that if you’re given a script, say a Python script, that reads a file from disks and processes it, and I want to give you that script as a standalone, just, “Here’s the script, tell me what’s what it’s doing,” or if I were to give you the same script with input examples, output samples, some documentation and so on and so forth, everything would be so much easier to understand. Even though I haven’t changed the system or its complexity one bit, things are much easier to deal with.
Quintessence Anx: I got you. What are some of the benefits of being able to understand, or understandability, as we’re discussing it?
Liran Haimovitch: First and foremost, whenever you’re developing a new feature, whenever you’re working on something new, you have to figure out how it combines with whatever you already have in place. Where does it go into the system? How can it impact its performance? How can it impact the stability? How can it impact the architecture? What potential drawbacks might it have, and so on and so forth?
Liran Haimovitch: To do it right, you have to understand the system you’re working on. The better you understand it, the better choices you’re going to make. If you make good choices, I think we all know that feeling when we’ve developed a system from scratch and we know every nook and cranny in it, and so we just make those instinctive calls, “This is where it should goes,” and things are very easy. If you get those decisions right, things happen very fast. You develop the code faster. Quality is higher. Tech debt is lower and so on and so forth. But if you’re failing to understand the system, and if you are making mediocre or even bad decisions due to lack of understanding, then you’re going to be spending more time on feature development. You’re going to deliver lower quality features with more bugs because you have more edge cases you haven’t thought of. You’re going to be struggling with that. In many organizations, they are so worried about it, that they spend so much time planning and investigating, running so-called spikes, just to figure out what’s going on.
Quintessence Anx: I mean, that makes sense, right? Because we’ve all been running around in circles, as you just said, trying to figure out what happened, when what we need is a deeper understanding, more and more information, somehow, about what’s happening. Speaking of running around, how can you tell when it needs work, like where to improve and how?
Liran Haimovitch: If an engineer on the team gets a task and is not sure how to approach it, even though the task is well-defined from product perspective, so the UX is well-defined, the desired behavior is well-defined, but the engineer is not sure how to approach it, where does it go into the system, what are the potential impacts of it, then your understandability probably needs work. If you are looking at a feature and you are not sure how to scope it, if you’re not sure what’s the potential impact or what are the risks of it, then all of that usually means that that engineer doesn’t understand the system well enough. If it’s just one engineer, especially if it’s someone you think that’s not a big hurdle, but if everybody on the team is struggling, or if most of the team is struggling, then the system is very hard to understand, and there is thing you can and should be doing to improve that.
Quintessence Anx: Got it. That makes sense. When you’re improving it, how does that affect your MTTR and other metrics that you might be tracking?
Liran Haimovitch: The thing is, once PagerDuty goes off and you know something has gone wrong, obviously the first thing you want to know is what went wrong and why it went wrong. Maybe you are seeing an alert that latency is going up, or request rates are down, or whatever. But then, you have to figure out how it combines into the big picture. Maybe you have one alert going off. Maybe you have 20 alerts going off. We play this mind game where we envision the system, we picture where are the alerts going off, what parts of the system are not working properly. And we are trying to figure out, why is this happening, what do all those alerts have in common, how do they combine? Obviously if the system is very simple, say it’s a single server with a set of APIs, then it’s very easy to say, “The API is showing error rates because the database latency’s up,” and so on.
Liran Haimovitch: But as systems become more complex, they contain more microservices, then we need to build a more complex mental model. When you wake up at 2:00 AM because all of those alerts going on, you want people to be able to quickly understand what’s going on, to identify the root cause, and to be able to remediate it as quickly as possible. While if they’re struggling due to lack of understanding, then that’s when things get messy. They have to wake up more people because each engineer knows a piece of the system, they have to look at more tools, they have to go deeper, and so on. The more you understand the system, the easier it becomes to figure it out.
Quintessence Anx: That makes sense. What are those key elements to ensuring a better understandability of your software and system?
Liran Haimovitch: We’ve already discussed complexity. Obviously, by reducing the complexity of the system, you can make it more understandable. However, that’s usually a very long process and something that doesn’t always have positive ROI. But there are many things you can do with existing system to improve their understandability. Obviously, you can build the knowledge, you can collect knowledge from existing and former team members. You can build a document, you can build procedures. You can create as much knowledge and tools and share them between team members, so that they are able to more easily understand what’s going on. You should create development environments, environments that replicate the production environments as closely as possible, while still allowing engineers to play around and experiment with them, so that they can take apart the system, put it back together, see how everything’s connects.
Liran Haimovitch: The other two important things you should be thinking of is, first and foremost, the observability tools. While observability tools are not meant and they are not the ultimate solution for understandability, they do provide some insights into what’s going on in the system. You should make any observability tools you already have available to software engineers, even if they are not the traditional audience. Because they can and should be able to get some insights from that.
Liran Haimovitch: Last but not least, obviously debuggers. Debuggers are all about understanding our own code, running it in slow motion, playing around with it, experimenting, deep diving into any point of the business logic, and seeing what’s going on. That’s true both for traditional debuggers, such as those in our IDEs, as well as the next generation of debuggers or debugging platforms, such as RookOut, that allow you to debug your code as it’s running remotely, in the cloud, even in production. All of those tools together allow you to bridge the gap of understandability, even in rather complex and dynamic systems.
Quintessence Anx: That makes sense. It makes sense that there would be tooling to support. Liran, you mentioned about the tooling chain around software understandability that can make sure that everything works and that you actually are getting the information out of it that you need. Towards the end there, you mentioned about the debugging platforms. What would really distinguish, or what would increase understandability between what you’re terming a modern debugger versus maybe debuggers that have been either more commonly used or historically used?
Liran Haimovitch: Traditional debuggers have been used for local debugging. I mean, you get clone the repo to your own machine. Then you build it, you run it in a debugger and everything is happening locally. In a way, nobody really cares about the process you are running right now, so you can use breaking break points. You can just stop it mid-running. You can reset the process as many times as you want. You can slow it down. So, it doesn’t provide any guarantees. It often slows the process significantly, sometimes even hundreds of percents. You’ll get a very soft touch-and-go feeling, something that’s very easy as you tweak the code on your own machine and keep re-running it, but something that wouldn’t be acceptable running in a production environment and also something that’s not very easy to set up for remote debugging, or even for service mesh debugging, where multiple processes depend on each other.
Liran Haimovitch: The next generation or the modern debuggers work quite differently. They are using technologies that are much more similar to APM tools or exception management tools and are very production grade. So they use non-breaking break points that get you the data you need without stopping the application and without slowing you down, without breaking the service mesh. They provide various guarantees for performance, stability, security, so that they allow you to get the data you need from the production environments without exposing sensitive data and without impacting negatively the user experience in any way. So you kind of get the experience you are used to, or almost experience you are used to, without having the ops team or the security team all over it because you’re breaking the guarantees.
Quintessence Anx: That makes a lot of sense, because we have SLAs we need to maintain and error budgets we need to stay within. So we want to make sure that we’re choosing things that are allowing us to stay within those confines that we set up for ourselves and what we promise for our different teams.
Quintessence Anx: So digging in to surrounding topics that are relevant to the space, so we have a lot of discussion around the observability piece as well, and DevOps and cultural changes as well as resilience engineering and those cultural changes. And these things are not necessarily overlapping, I realize. But when you’re talking about like, you have the cultural that you need to set up and then you have the tooling that needs to support that culture, right, so debuggers as part of that. But what about those cultural practices? What about being able to have effective communication, or in the case of resilience engineering, being able to support the working pieces and keep everything… Again, build in more robustness into the system. How does switching the tooling for the debugger help with all of these different components that are moving around the human parts?
Liran Haimovitch: That’s a great question. The human parts have always been a huge part in software engineering. I think over the last years, maybe 15 or 20 years, we’re coming to realize how much so, how much so the human part, the software engineers, the product managers, the team behind the software, makes a huge difference in culture and communication in everything we do. That’s important. In a way, it emphasizes the significance of understanding, because we need to understand each other and we need to understand the software we’re working on, and we need to share that knowledge. We need to communicate that understanding to each other so that you can all work together. In many ways, actually, those new toolings allow us to do it even better, because new tooling allows us to collect live data from production and share that live data so that we can all be aligned. Rather than imagining things or keeping the knowledge in our hands, it allows us to very clearly measure, whether it’s collecting variable values, getting tracing information, profiling information metrics. And then we can all communicate together about those metrics, that information we’ve gathered. And so we can better understand, better understand our software, better understand our goals, better understand what we’re trying to achieve.
Liran Haimovitch: I think resilience is also a great topic, showing how hard we are working to understand our software. Because resilience engineering is all about, “Our system is so complex, our system is so dynamic. We don’t know what’s going to happen when it’s going to break, and we’ve put in place so many safeguards and fail overs and all these cool features that are supposed to keep the system running.” And yet, as the system becomes complex, we are not sure we understand what’s going to happen. So we know that one piece is going to fall, might fall. And we know that we want, that when that piece falls off, everything’s going to fix itself. And we’ve designed it to do so, but we don’t totally understand what’s going to happen until we actually test it out.
Liran Haimovitch: Actually today, Google had a pretty big outage. And that’s exactly what’s going on. Chaos engineering, in a way, is allowing us to test that. We inject failures locally, gradually, and see how everything is coming together. And those modern debugger can also be a part of that process. Because again, allowing us to, on the fly, decide what we want to collect, how we want to collect it. Say we’re injecting a failure and want to add additional instrumentation on the fly, just before we inject the failure. We can do that. Say we injected the failure and things aren’t going as expected. We can inject additional instrumentation to see what’s actually going on, and improve our visibility into the system life, as we are learning it, as we are trying to experiment and see what’s going on.
Quintessence Anx: That makes sense. You mentioned about chaos engineering, too, which prompted me with one more question before we close out. When we’re talking about chaos engineering and being able to run those chaos experiments, you really do need a deep understanding of your software. I was wondering if you wanted to share any experience you’ve had about using debugging with chaos experiments to help resolve those faster.
Liran Haimovitch: Actually, we had a case with one of our customers and they were running various chaos engineering processes. They were Kubernetes-based. So they were killing off pods and nodes and so on, and seeing how the system is going to respond. Actually, as they were killing off, one of their… they had a node that failed to recover properly. There was an application that didn’t recover properly, and they actually spent days trying to investigate that the issue. All of the sudden, there was a, some kind of a performance bottleneck that showed up in one of their applications. They kept investigating and they couldn’t figure out why. By actually using RookOut, they’ve set a bunch of non-breaking break points and they saw that they’ve hit some of the EFS, the Amazon EFS bandwidth limits, and that bottle necked their application. Now they’ve pinpointed that limit, and using RookOut, they could re-architecture that specific part to avoid that limitation.
Quintessence Anx: Yeah, that makes a lot of sense. Thank you so much for sharing all of that with us and diving into this topic with us today. For everyone listening, if you want to learn more about any of these topics, please check out RookOut at their website, rookout.com.
Quintessence Anx: Before we head out, we have two questions that we like to ask every guest. Are you ready?
Liran Haimovitch: Sure, I’m here.
Quintessence Anx: What is one thing you wish you would’ve known sooner about software understandability?
Liran Haimovitch: I wish it would have been possible before. I’ve spent so many years of my career chasing bugs, just waiting for log lines waiting for deployments. I kind of wish this would have been possible before, that I could just get the information I need without rebuilding, redeploying, restarting, upgrading. Just thinking of the nights I’ve spent watching a build and deployment progress not going.
Quintessence Anx: Oh yeah, and trying to determine what’s going on as it’s crawling across. No, that makes a lot of sense. What about anything you’re glad we did not ask you about?
Liran Haimovitch: I’m glad we didn’t dive in too much into RookOut. Obviously, I love the product. I founded the company, but it’s nice to speak about other stuff, whether it’s understandability in general or chaos engineering and resilience. It’s a nice change of pace, not just talking about myself all the time.
Quintessence Anx: Well, we appreciate you, and the knowledge you bring. Thank you so much for joining us today, Liran.
Liran Haimovitch: Thanks for having me.
Quintessence Anx: Absolutely. This is Quintessence wishing you an uneventful day.
Quintessence Anx: That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and, you can reach us on Twitter @pageit2thelimit, using the number two. That’s pageit2thelimit with the number two. Let us know what you think of the show. Thank you so much for joining us. Remember, uneventful days are beautiful days.
For a full transcript of this episode, click the “Display Transcript” button above.
Liran is the Co-Founder and CTO of Rookout. He’s an advocate of modern software methodologies like agile, lean and devops. Liran’s passion is to understand how software actually works. When he’s not thinking of code, he’s usually diving or hiking.
Quintessence is a Developer/DevOps Advocate at PagerDuty, where she brings over a decade of experience breaking and fixing things in IT. At PagerDuty, she uses her cloud engineering background to focus on the cultural transformation, tooling, and best practices for DevOps. Outside of work, she mentors underrepresented groups to help them start sustainable careers in technology. She also has a cat and an aquarium with two maroon clown fish and a mantis shrimp, of The Oatmeal fame.