Reliability of Cloud Dependencies with Jeff Martens

Transcript

Mandi Walls: Welcome to Page it to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability, and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @lnxchk on Twitter. Hi everyone. Welcome back to Page to The Limit. With me this week, I’ve got Jeff from Metrist. Jeff, tell us about yourself and what you do?

Jeff Martens: Hey there. Thanks for having me. Yeah, as you said, my name is Jeff. I am the co-founder and CEO of a company called Metrist. But before I get into that, I’ll tell you a little bit more about me. I’ve been working in incident response and observability and developer tools for, gosh, 13 years now. I started in this industry at New Relic in product management, and then I was at PagerDuty for a little while also in product management. But over that time, I started noticing something that I thought was a problem that needed to be solved. And so my co-founder and I started Metrist and built an awesome team to go solve this problem. And that problem in a nutshell is that software is increasingly built on top of other software, and that means that the reliability of our software is heavily dependent on the reliability of our cloud dependencies. And in order to manage that properly, developers and anybody in an organization that’s responsible for an SLA, they deserve to have clear, timely metrics about the performance, the reliability, and the availability of the tools that they leverage to build and deliver software.

Mandi Walls: That’s a lot. That’s a lot going on there.

Jeff Martens: Yeah, it’s fun because in a way it’s a lot, but in a way it’s also very focused. Unlike a lot of other observability companies, we really focus on and care deeply about a very specific layer of the stack, and that is the cloud dependencies that we build on. So all the products from AWS, APIs from Stripe or EasyPost, or even developer tools like Circle CI or GitHub.

Mandi Walls: Oh wow, okay.

Jeff Martens: So it’s a lot, but yet it’s highly focused and I think that’s what makes it unique and fun for us.

Mandi Walls: That is cool. Are you also looking at third party JavaScript downloads and that kind of stuff too? All that weird stuff that comes in [inaudible 00:02:28]?

Jeff Martens: No. For us, we’re really just focused on the cloud dependencies that power our software. And so for us when we say cloud dependency, what we really mean is some other piece of software that’s delivered as a cloud or SaaS application. So we’re not necessarily looking at dependencies that are integrated into your app as a package, but rather external calls that are made to go do something.

Mandi Walls: Oh, okay. Cool. How’d you pick that? Is there a horror story behind that?

Jeff Martens: Well, really what it came from was just to a bunch of folks that care about observability and incident response. In my role as a product manager over the last 13 years, I had an opportunity to talk to a bunch of people and do it in a way that was really open and inquisitive. And I just started hearing people say like, “Hey, my customer-facing downtime is starting to come from a new place, and that is these things I build on.” When I stepped back and started looking at it, I don’t think building on these things is new. I would guess that somebody like a Twilio was really the pioneer in the space. Where Twilio said, “Hey, if you’re a developer and need to build in some kind of communications capabilities into your app, don’t worry. We got an API for you and let us manage all that stuff.” I mean, a very similar story to the cloud itself, but I think there was this whole thing in addition to the cloud, like the API economy as some people call it, where we can really just pay a small amount of money for a very specific functionality, whether that’s sending a text message, or printing out a shipping label, or charging a credit card, or managing user authorizations. That’s all just an API call away now. And while that’s been around for a long time, I think the problem really started bubbling to the top just in the last couple years as we started using more and more of these third parties. But also as our observability of our own software begin to mature. As you start getting your own house in order, the next thing you start looking at is like, “Okay, well what else is impacting me?” And so a thing that was maybe the third or fourth top reason for customer-facing downtime became the first or the second top reason for customer-facing downtime. And so for us, it really wasn’t necessarily that we picked it, but we just kept hearing it over and over again and we said, “Okay, there’s a there there and we probably should go see if we can solve for it.

Mandi Walls: That is super interesting. And it’s just this combination of all this increased complexity and the availability of so many cool things that you don’t have to write yourself, but also customer expectations have, I think, increased. There’s a lot more patience in some areas for things that are down or things that are wobbly, and a whole lot less in other places.

Jeff Martens: You bring up really good point, and I think there’s two different types of customer expectations. The first one is, once I experience something awesome in an app, I want it in every app, right?

Mandi Walls: Yeah.

Jeff Martens: So as a user, I want to be able to do all the things and I want it to be a beautiful experience that’s super easy for me and just feels like magic. We see a lot of that in the consumer app space and now we expect it in our business apps too. So now we have an expectation for any app that I use, whether it’s personal or work, whether it’s on my phone or on a website. I want to have the greatest capabilities with the best experience. And so, often the way to do that is instead of building it yourself, is to use a best-in-class third party to deliver that as a cloud dependency. So that’s the first thing. And then the second thing is, okay, now that I love this thing and I care about it so deeply, if it has any amount of downtime, I’m all of a sudden really frustrated. And so the user expectations have increased in a lot of ways, and that’s made it really complex, I think, for teams to deliver reliable software, because they no longer have to worry about just delivering the software that they wrote, they have to worry about how all the software they depend on is delivered as well.

Mandi Walls: Yeah, definitely. And sometimes you don’t know when you’ve picked something how good or bad it’s going to be. Maybe there’s some word of mouth, maybe there isn’t. Do you guys provide that too?

Jeff Martens: Yeah, it’s interesting. I think the main way we know if a tool that we’re about to integrate with and rely on, the main way we know if that’s reliable is looking at their status page. I think that’s what most people would turn to. And on the surface, that seems fine. Like, “Okay, I looked at their status page and they said they haven’t had any issues recently, or maybe there’s one in the last 90 days.” But we don’t think that tells the whole story. And that’s because typically status pages are going to report on an SLA, and an SLA is going to be a very specifically defined rule set. It says, “If we miss on this one thing for this period of time, for this many customers, we’ve missed the SLA and we’ll report on the status page.” Other times it’s just not feasible to be updating a status page for every little thing that happens. And so I don’t think we have good visibility into the reliability of these products before we even start using them. And so what do we do? We use things like what our friends think or what they say. And sometimes that’s really helpful, other times it’s not, because how they use a service or their reliability needs may not match yours. And then we think back to our perception of something. One of the examples I like to use is, if we look at AWS, I would say by far, AWS gets the most complaints and talk about being unreliable. But guess what? I think that’s completely misguided. I think AWS is one of the most reliable platforms that the internet has ever seen, but we talk about it a lot because we care about it a lot. So if we just were to go on perception, we might say, “Oh, well maybe this product isn’t that reliable.” But at Metrist we can put data behind that. So we are continuously testing and monitoring the functionality, the performance, and the availability of the most popular cloud products. Within that, for example, is about 15 different AWS products that we monitor from many different regions. And we empower our customers to monitor it from their own point of view. And we can bring all that data together and give you a really factual understanding of how reliable is this service? Does it meet my reliability needs? And how does it compare to the other services out there? And so I think that’s a much better way to evaluate a service before you sign a contract than just thinking back to what people talk about on Twitter or what your friends told you about their experience with something.

Mandi Walls: Yeah, definitely. Because like you said, the SLA might be overly broad, or it might be overly constrained to one thing that you may not even be using out of the product, and that’s what they’re guaranteeing on and has maybe nothing to do with your use case or the things that you care about. So that’s super interesting too.

Jeff Martens: Yeah, exactly. And not all outages impact every customer.

Mandi Walls: Absolutely.

Jeff Martens: Every user. So I know a lot of companies, understandably, they have a rule in place that says, “Well, we won’t update our status page unless a certain percentage of our customers are impacted.” Often that’s like 5% or more. But if it’s 4.9% of the customers impacted, that wouldn’t trigger a status page update. But you care deeply. You’re saying, “Hey, look, I still need to know this.” And so I think we have a little bit of a mismatch on what people expect from a status page, and what they actually need. And it’s almost like trying to put a square block into a round hole.

Mandi Walls: Definitely. Yeah, I mean it’s good that they’re there. It’s definitely an improvement I think over the last few years anyway, that there’s at least an understanding in the industry that we all care about this. One of the stats you sent me was the digital businesses are using more than a hundred SaaS products. And I feel that stress, I open my Okta, and it’s just like I don’t even know what half of this stuff is. And I am not a developer, so I know the engineering team has a whole other suite of crazy business that they’re working on, and it’s just this Cambrian and explosion of products and services and potential, but also a whole lot of confusion and unknowns out there.

Jeff Martens: Absolutely. And when you look at that hundred plus set of products that a typical digital business uses, what we’ve found is that probably 30 to 40 of them are directly integrated into the product that that company delivers. It’s one thing if a SaaS tool you rely on goes down, like Slack goes down, that’s pretty disruptive to my workday, but my customers may not be impacted by it.

Mandi Walls: Yeah, definitely.

Jeff Martens: But if something like Stripe goes down, all of a sudden my business is directly impacted by this because my customers can’t purchase something from me. But the confusion becomes like, “Okay, I know there’s a problem, is it me, or is it them?” That’s the question that a lot of people are asking when they go into incident response. I think a lot of us, you and I and everyone listening knows that 10 to 20 minutes during incident response actually makes a huge difference.

Mandi Walls: Yeah, forever.

Jeff Martens: That’s about what we hear it takes people to either blame a third party, or give innocence to a third party is it’s typically 10 to 15 minutes. And I think both of those things are powerful blame, or innocence. And it’s not to say blame like, oh, it’s their fault. I think it’s still our responsibility to deliver reliable software, but understanding where the problem might be coming from is going to help us out. But understanding if the problem’s not coming from a third party cloud dependency is also valuable. So how can I better use my incident response time? Well, if I go into incident response knowing the real-time status of all the dependencies my software has, I can then determine where to put my focus. Do I put it on my stuff, or do I put it on that third party? And I think either way, we’re driving down incident response times by five, 10, 20 minutes.

Mandi Walls: Yeah, that’s awesome. I mean, we own our own availability or whatever, our own reliability. It’s kind of a pat sort of in an ideal world, yes. But knowing all of those components and having a good practice and hygiene around, I have now added this facility into the system. I need to be responsible for making sure that everyone else knows that I’ve done that and here’s where I need to put it. And having some kind of practice around that documentation and that knowledge sharing is super important. I love finding hidden dependency, that just makes me so happy.

Jeff Martens: Well, I’m glad you said what you just did and the way you said it, because it really aligns with how we think about it. We consider this to be cloud dependency management and it’s management, because it’s not just innocence or blame. It’s not just incident response. There’s a whole lot that we can do to protect and ensure our own reliability when we build on these cloud dependencies. So the first thing we can do is find out about an issue as fast as possible. The next thing we can do is verify our suspicions. If we think it’s a third party, but we don’t know how can we answer that question as quickly as possible? What kind of information can I give you to help you better respond to resolve that incident faster? All that’s incident response. But then there’s a whole next level of things. How do we use automation on this kind of insight to avoid incidents, or at least avoid impact from incidents? How can I use this level of intelligence to make decisions like pausing a deploy pipeline, or automatically failing over to a backup payment provider because I can’t risk not taking a payment at any given time. But then from there, how do we hold our vendors accountable and maybe more importantly, how do we work with our vendors to improve their reliability? I did an analysis for a company we were working with, and we took a look at what kind of savings that they could get from resolving their incidents 10 or 20 minutes faster. And it was a notable amount of money, but what was more interesting is if they could push one of their most critical cloud vendors to go from three nines to four nines, that actually has a bigger impact on their top line and bottom line than reducing instant response times. So how can we use this kind of data about the real experience you have with the cloud product to have a conversation with your vendors and to work with them to provide a service that continues to up-level in its reliability and meet your needs? And then finally, going full circle, how do we work with the right vendors? How do we work with the vendors that meet and support our own reliability needs? Maybe performance is more important than availability, but how do you know those kind of things beforehand? Unfortunately, we tend to sign a contract and then discover it, and then we have to do a rip and replace to something else. But what if you could go into it with knowledge of, here’s all the players in this space. And I can go to Gartner to understand their features, and their pricing, and their roadmap, but where do I go to understand the reliability and the user experience that they delivered over the last 12 months or 24 months? And that’s where Metrist can come in and say, “Look, we can help you pick a better vendor so you don’t even have to think about as many incidents down the line as you maybe would if you mistakenly pick a vendor that doesn’t meet your reliability needs.”

Mandi Walls: No, that’s super powerful. And even after you have an established relationship with a vendor, those contracts come up for renewal. And it’s great to go in with actual data and not a feeling that in my gut, I feel like you’re letting us down and you’re not as reliable. But if I have actual data to be able to say, “Well, over the last eight months of our 12 month contract, you haven’t been performing, here’s what we’ve been seeing.” And actually have a conversation about it with that vendor seems very powerful.

Jeff Martens: Absolutely. And I think everybody wants to do right. I believe that cloud vendors, they want to do right by their customers, but we operate in these complex environments with thousands or tens of thousands of customers, and it gets really hard. If our product was experienced entirely as a website, there’s a lot of ways that I can get visibility into my user experience. But that’s just not how software is anymore. It’s less software, and more systems. And it’s APIs and it’s back ends and front ends and mobile apps and all these third parties that tie into it to do a very specific thing in a specific way. And it gets really hard to say like, “Did I deliver to this specific customer, this specific thing?” One of the reasons we created Metrist, to share a quick story is, when I was a product manager at a prior employer, I went to a executive briefing a few months ahead of a contract renewal. And these are pretty common things we do in SaaS for large customers. You sit down with a customer, you share your roadmap, you ask them what they need from you next year, and you’re really saying, “Look, if you resign our contract, we’re going to continue to be a great partner to you.” And in this one instance, we sat down in the customer’s office and the first thing they brought up was our reliability. And they said, “Hey, you haven’t done well in this last month in this last quarter.” And we looked at the customer and said, “What are you talking about? We had one of our best quarters ever for reliability.” And they essentially said, “Look, we don’t care what was on your status page. We don’t care what the experience was of all the other customers you have, we care about the experience we had with your product.” And that was a perfect example of an inability to communicate with each other. It wasn’t that one was right or wrong, or somebody was mad and somebody was defensive. It was, help us understand. So myself and one of the representatives from our engineering department, we were pulling up all of our dashboards, we were filtering on everything we could to try to find some data that supported what this customer said, and we couldn’t find any of that data. We asked them to present it, and they had anecdotal data, but not actual data. And that was just such a missed opportunity for us to have a really honest conversation with each other to help each other get better. Sitting in that room that day, I thought to myself like, “This is nuts that we can’t have this conversation based on actual data and that we can’t agree. I think we want to agree, we just need the data to prove it.” And so that’s really what we’re set out to do at Metrist is to create the data that allows cloud vendors and their customers to have the same understanding of what the experience was with a product, whether that’s in real time for incident response, or trending over time as they manage their vendors, manage their customers, and ultimately grow their product and improve their reliability.

Mandi Walls: Awesome. I mean, it’s just going to get more and more important as things get more complex users require or expect more features and want better performance out of absolutely everything that they touch, so.

Jeff Martens: And it just doesn’t make sense for us to try to build all of that ourselves. I talk to a lot of folks, and I think the ones that are the most mature in their engineering process are the ones that are saying, “Look, we’re going to build the things that are our core competency. We’re going to build the things that are our competitive differentiator. And anything else, we’re going to look to a best-in-class cloud vendor to help us meet that need.” That’s the smart way to approach it because it’s no longer a, should we offer this or not, your customers expect you to offer these experiences. You have to do it. What’s the best way for you to do it? And I think increasingly, we’re going to look to other companies to help us do that. One of our customers was telling me that a couple years ago, they had about 30 to 40 infrastructure API and SaaS dependencies in their organization. And today it’s about a hundred.

Mandi Walls: Ah, gosh.

Jeff Martens: So in a few years it’s about, it’s almost tripled.

Mandi Walls: Absolutely.

Jeff Martens: And I think that’s a story we’re going to continue to see for a long time.

Mandi Walls: Yeah, it’s a fascinating space, because there’s all these things and behind the hood, your average application may not be that different from the next one. So it becomes even more important for the actual creators, have something that’s super differentiating and not waste your time on all these things that are just now table stakes. Yes, you should have payments. Yes, you should have personalization. Yes, you should have X, Y, Z, but don’t build that yourself. It’s like building your own security or encryption software anymore. Get that somewhere else and spend all your time doing what makes your business interesting.

Jeff Martens: Absolutely. And I think if you look at certain industries, like if you look at e-commerce for example, there are so many things that can be done to optimize and improve the customer experience and sales, but those things can be massive. They’re companies in themselves. Like a recommendation engine, that’s not necessarily a trivial thing, that can be a really complex thing. And that’s not the only thing you have to do as an e-commerce company. You’ve got a dozen other things like that, that you need to put in your application to serve your users and to serve your business. And these are really complex things, but they’re no longer optional. They’re table stakes now.

Mandi Walls: Awesome. Well, this has all been super interesting. That’s fascinating. Definitely, as you hear as you watch little wobbles and rumbles and interesting things about some of these products, it’s definitely time to figure out for real what is going on with all these systems so that everybody else can build better stuff and have a better foundation to work from.

Jeff Martens: Absolutely. I mean, I think that’s what we all want, right? As product developers and as cloud and SaaS vendors, we want to deliver highly performer our products that are super reliable. And the best way for us to do that is to have a clear understanding of what’s happening, where, to whom, and it’s going to benefit all of us in the long run.

Mandi Walls: Awesome. Yeah. Sweet. So one of the recurring questions we have on our show is to ask folks to debunk a myth. So if you’ve got a pet, especially the pet peeve ones, the ones that folks really always get wrong and you have to straighten them out, what’s something that folks really don’t get the right lens on for this space?

Jeff Martens: Oh gosh. There’s a number of things I think I could bring up, but I’ll start with this one. I talk to a lot of folks that say, “Well, I hear what you’re saying. My cloud dependencies really do matter, but I can’t do anything about it. So what’s the point?” So for me that that’s the first one that really I think is misguided, but not because people want to be negative about it, but I just don’t think we’ve ever thought through what can we do? And so to debunk that myth, there is something you can do when you have third party dependency issues. And it goes back to what I talked about earlier. The first thing we can do is we can pick vendors that better meet our reliability needs to reduce the risks of having outages that are caused by issues with our vendors. After that, we can work with them to better manage the service that they’re delivering to us by sharing data back to them about the experience that we’re having in real-time, all the time, so that they know how they can best support their customers and the apps that are built on top of them. Again, reducing risk. And then if we do get into a scenario where there is going to be potential impact, how can we use automation to reduce our chances of being impacted? So one of the things that we’ve noticed as we monitor and test all of these products is, in hindsight, we can often see warning signs where a service is highly reliable, hasn’t had any issues out of the ordinary for weeks or maybe even months, and then you start seeing some latency weirdness, and then some massive spikes in latency, and then an outage. So one of the things that we think people should be able to do is use that kind of data to put together some probability models. How likely am I to have an outage and the next one, two, three, five minutes? Can I take that information and use it to make decisions? Or maybe it’s not even necessarily a probability. Maybe there actually is something going on. And you do something as simple as pausing your deploy pipeline just so you don’t introduce any unnecessary or additional risk if a thing you depend on is showing signs of not being its normal healthy self. And then I think about like, “Okay, so now an outage has hit you, what can you do about it?” And yes, there are times that there’s nothing you can do about it, but there’s times that you can. How can you fail over to maybe a backup provider instead of just waiting it out? But at the very least, maybe it’s as simple as saying, “Look, there’s nothing we can do about it, but I can let seven of the eight people on this incident response call go back to work and not worry about it.” So that’s the myth that I would debunk. There’s always something you can do about it when it comes to third party outages, it’s just that sometimes those things are proactive to help reduce the risk. And other times those things are reactive and they may only help you improve your efficiency, but there’s always something you can do about it.

Mandi Walls: With all this stuff, do you have recommendations for how a team picks which vendors to watch first? If I’m going to start this practice and I know I’ve got X number, some three-digit number of vendors, where do I start? How do I pick what to go for? Do you start with the ones that are the messiest, or the ones that are the calmest?

Jeff Martens: That’s a great question. So it’s first, understand what your dependencies are. Another thing that I find really interesting as I talk to folks is when I first meet them and I introduce what we do, and I ask them, well, how many different cloud dependencies do you have? And I usually get a number, and that number may be one or two dozen. And by the end of our first call, or maybe into our second or third conversation, that number has gone from being one or two dozen to being 40, 50 or even 60. And that number sounds crazy, and I had a tough time believing it myself, but it is now the common number that we hear 40, 50, or 60 things that our product depends on. And so the first thing that you can do I think, is just understand what are my dependencies, and where are they? The next thing I would do is look at, well, what’s the SLA that all of these dependencies offer? That’s important because if you say, just for simplicity’s sake, let’s just say you have a SLA of 99.9% uptime. Well, if you have a bunch of products that also offer 99.9% uptime, you might think, “Oh, I’m good. They have three nines. I have three nines.” But because those are all separate systems, that’s not actually how the math works. And so in order for you to have three nines yourself, you’re going to need to rely on products that probably have even better than three nines. But to understand what that number is, you have to look at what’s in the critical path. The truth of the matter is a lot of these 40, 50, or 60 dependencies we have, they’re not all in the critical path. And some of them are things that support a feature that you maybe don’t have a reliability target around. Other times, they’re things that you can very easily work around if they go down for 10, 20 minutes. But there’s a handful of things that are in that critical path. And so look at the things that are in the critical path, understand the SLAs that they offer, and using some pretty simple math, we can start to understand like, “Okay, if I’m going to offer three nines, here’s all the things that I depend on and I need them to all offer whatever that math says based on where they are in the critical path and how many different dependencies you have.” So that’s how I would get started. It’s really about knowing where your risks are.

Mandi Walls: Yeah, definitely. They could be anywhere really. Awesome. Well, Jeff, thank you so much. Where can folks find you? Where can they find Metrist online to find out more?

Jeff Martens: Yeah, yeah. So we are at metrist.io, so M-E-T-R-I-S-T.io. We also have a weekly newsletter called, What Went Down. So you can go to metrist.io and under the resources tab, there’s a link for What Went Down. You can subscribe to get a weekly email that highlights all the cloud outages that we know about, many of which never make the news and never make a status page. And it’s really interesting to see and follow those trends, because a lot more is happening than gets talked about on Twitter or gets posted to on status pages.

Mandi Walls: Awesome. Fabulous. This has been great. This is super interesting. As I watch the engineering team pull in all kinds of things and various places that I’ve worked, I’m like, “Man, how do we know?” Right? Now we have an opportunity to know, so this is perfect. Awesome.

Jeff Martens: Yep. I mean, you wouldn’t operate your software without having proper visibility into all the microservices inside your org that you depend on.

Mandi Walls: Exactly.

Jeff Martens: So why would you operate software with third party dependencies without having that same level of visibility?

Mandi Walls: Absolutely. Perfect tagline. We’ll leave folks with that. Jeff, thank you so much for coming on the show.

Jeff Martens: It’s been my pleasure. Thank you for having me. This has been a lot of fun.

Mandi Walls: Excellent. So thanks everybody for listening for our show this week. We’ll be back in two weeks with another episode. In the meantime, we’ll wish you, and all of your vendors, an uneventful day. That does it for another installment of Page it to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast. If you like what you’ve heard. You can find our show notes at pageittothelimit.com, and you can reach us on Twitter at Page it to the Limit using the number two. Thank you so much for joining us. And remember, uneventful days are beautiful days.

Reliability of Cloud Dependencies With Jeff Martens

Transcript

Show Notes

Additional Resources

Guests

Jeff Martens

Hosts

Mandi Walls (she/her)