An Exegesis on HA and DR with Rich Lafferty

Transcript

Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting their system. I’m your host, Mandi Walls. Find me @LNXCHK on Twitter. All right. Welcome to the show. This week we have with us Rich Lafferty, he is a staff SRE here at PagerDuty. Rich, welcome to the show.

Rich Lafferty: Thank you. I’m glad to be here. It’s always fun to have insiders on.

Mandi Walls: Yeah, it is. We’ve had a couple on recently, that’s been a nice addition to our stable. So, tell us a bit about what you do at PagerDuty? What does a staff SRE do?

Rich Lafferty: What does a staff SRE do? I mean there’s kind of two parts to it. So, yes. I’m staff SRE here. I’ve been at PagerDuty for about four years and I’ve been the staff engineer for about a year. And so, you can kind of break it down into two parts. On the SRE side, our SRE teams here are essentially responsible for the platform that PagerDuty runs on. We have a lot of engineers, we are not a small place, and what we want to make sure is that the product teams here can focus on delivering product features and availability and performance, all those non-functional requirements, to customers as easy as possible. And so, the SRE teams here are basically responsible for sort of setting up a golden path to make sure that that happens. And part of that’s the platform. Right now, we’re running on HashiCorp’s Nomad, with some plans to move to Kubernetes in the next little while, just to kind of take advantage of the extended ecosystem there, the platform. But still, the important part is the abstractions. We want to make sure the developers don’t have to think about the low levels so they can think about the stuff that’s the highest leverage for our product team to do. One of the things that PagerDuty does of course, is that teams that write software own their software through the whole development life cycle, including operating it. But the more we can kind of shift away in the abstraction so they don’t have to worry about hosts or they don’t need to worry about networking or all that stuff, then the better. So, that’s basically kind of what the SRE org does here. And until very recently, I was an individual contributor on one of those teams working on that platform. Since I moved into the staff though, I’ve kind of started thinking a little bit more organizationally and a lot less… I joke that my editor of choice these days is Confluence.

So yeah, a lot less kind of doing the work and a lot more… The main thing I’ve been thinking about and working on how do we do organizational change and culture change and so on is in the area of reliability, like really emphasizing the R in SRE. And so, there’s been a bunch of stuff on that. The thing that I’ve been working on most recently has been on our incident review process, our postmortem process, but that’s not exactly the thing we’re talking about. I think, the thing I was doing before that was making some clarifications here about how we do high availability and disaster recovery.

Mandi Walls: Yes. So, for folks listening at home, Rich sent out an email to all of us titled “An Exegesis on HA and DR”. And it caught everyone’s eye, because it’s not every day that you get a work email that uses the word “exegesis”.

Rich Lafferty: That’s true.

Mandi Walls: So we were really interested in what exactly was going on here. So, yeah. Rethinking HA and DR, tell the folks out there, what are you working on there?

Rich Lafferty: Yeah. So, I mean, so the reason it was an exegesis, and I got that word from [Larry Wall 00:03:27] , who… Actually, I guess and [Damian Conway 00:03:29] from the Perl community, years and years ago when Perl 6 was first launched, that’s how long ago we’re talking. Larry Wall wrote some apocalypses, i.e. explanations of what was happening in Perl 6 and then gave me a common, followed those up with exegeses i.e. different explanations. Yes. So, what the heck Larry was talking about? And so, a while ago we published an ADR, an architecture decision record here about some changes to how we are going to do high availability and disaster recovery. The big change long ago, PagerDuty basically ran active-active, multi region. We were in two AWS regions and we were also in Azure and all of the regions handled traffic at the same time. And the idea was that, if we were to lose a region, then the other regions would kind of continue operating. And so, the failure domain at that point, if you want to think about it that way was that we could have a region failure and not notice. Of course, that’s really hard to do. And the reality was quite different from what we had designed, because it’s really hard to test failing a whole region. But also, we were spending a lot of time kind of setting things up for the possibility of a region failure. And what actually ended up happening, you start looking through the postmortems and so forth is … Regions don’t fail very often. I mean, knock on wood. I’m, I’m big saying this now, but regions don’t fail very often. And a lot of other things of course do fail because we’re running complex systems and complex systems have emergent behavior and all those wonderful things.

And so, we were spending a lot of engineering time and cognitive time and so on, on making sure that this active region thing worked. So, a while ago we decided, that doesn’t make a ton of sense. We’re going to reframe this a little bit. So, we published in this architectural decision record year and a half ago. They basically said, “Hey, we’re going to stop doing this.” Oh, the other thing I should mention in there, we’re also doing all of our replication synchronously, which you have to do if you’re active.

Mandi Walls: Oh, yeah because you need all the data and all the places all the time.

Rich Lafferty: You need the data and all the places all the time. And for a lot of things, we couldn’t do, we couldn’t have failures and so forth. We didn’t want eventual consistency in the middle of one of our customer’s incidents. And so, they wanted to see things up to the moment. So, there was a lot of overhead. And so, we kind of said, “Okay, we’re not going to do that anymore. Instead, we’re going to focus on running out of a region a time.” And, a second region is going to be there for disaster recovery. We’re not going to do synchronous replication essentially across the land, across between regions anymore and so forth. And so, we published this. And, the challenge was that over time people have left PagerDuty. New people have arrived and so on. And if you weren’t here for what was there before, if you weren’t here when we were running [inaudible 00:06:08] replication across the way. And if you weren’t here when we were running, when spanning Cassandra clusters, which I can’t say I recommend, then you didn’t really have the history of why are we doing it this way. And so, it was a good opportunity to kind of reinvest in the ADR and say, “All right.” So, there’s a bunch of contexts that didn’t get captured in here. And I’m going to kind of explain this on first principles. What do you need to know about how we think about HA and DR here and really kind of explain that the big thing. The big thing that people don’t always get is that high availability, HA, and disaster recovery, DR, are different processes.

Mandi Walls: Absolutely. Yes.

Rich Lafferty: Right. The big thing kind of came away from that was to really clarify in people’s mind is that fundamentally, you’re going to use one a lot more than the other. I guess we should probably take a step back and explain to everybody what the heck I’m talking about when they say they’re different. So, you’ve got these two things and they’re both being available all the time. Something terrible has happened and we want to stay up, especially for a company like PagerDuty, where the challenge is being up when our customers are not. And doing so on the same clouds, the customers run on. We’re on AWS. And so, when we have problems in AWS or customers, we also have problems in AWS and we can’t move to another cloud to fix that, or we may. I’d be saying the exact same thing, but I’d be saying GCP. Fundamentally, you’ve got these two things. One is high availability and that is regular operations. High availability is how you essentially meet your SLOs, so to speak. It’s like, how do you engineer software? How do you engineer and distribute the system? Because we’re talking about distributed systems here. This is really [crosstalk 00:07:42]. Yeah. How do you engineer a distributed system to be available all with [inaudible 00:07:49] all of the time, for whatever value of all. You’ve agreed with [crosstalk 00:07:54].

Mandi Walls: However many nines you like, yes.

Rich Lafferty: Exactly. So, that’s it. High availability is what’s happening all the time. There’s the thing that complex systems from the [Richard Cook’s 00:08:02] paper, how complex systems fail. The complex systems run integrated mode. They’re always failing. There’s full of latent failures. And so, the thing there is that, sometimes those latent failures are visible and you want to minimize their visibility because that’s what we call an incident. So, high availability is of applying systems engineering techniques to these complex systems or these attributes and systems that we build so that they remain available even though components of the system might not be. Disaster recovery on the other hand is, well, I like to say I’ve kind of slightly tongue in cheek, but also some tooth in it. Disaster recovery is the thing you do so that when something extremely bad happens, you don’t have to fold up the company and go home. And, disaster recovery of course is not just a technical thing. The folks over in HR are doing disaster recovery exercises. The folks in facilities are doing disaster recovery exercises. And ideally, all of these things are coordinated somehow. But really, what you’re looking at a disaster recovery is that you’ve decided that an event is going to happen that is going to be outside the risk that you’ve taken on in your normal operations. And you want to be able to continue to operate the business after you recover from that. So, it’s kind of like the main difference here is that high availability is the stuff that’s happening all the time. It’s the stuff that you engineer in. And, disaster recovery is to handle the black swan event.

Mandi Walls: Yeah. And I think, a lot of folks, as they’ve been looking at being in the cloud versus being in your own data center, when you think about back in the day where you’re in your own data center, you had one place in one location. And maybe, you were in Loudoun county and you were in San Jose, or you were in [inaudible 00:09:41] or whatever your case might be. And moving into the cloud, and especially the way AWS is now architected. It’s not that model anymore. Those availability zones, that region isn’t one building. So, you get to rethink what the blast radius is of your failure.

Rich Lafferty: Yeah. Blast radius. It really is blast radius or failure domain. And so, I mean, fundamentally you still can’t get away from physics.

Mandi Walls: The speed of light is going to hang you up every time.

Rich Lafferty: Yeah. And there are still buildings and there’s still servers, even though they’re really complicated and cool servers with natural controllers and stuff like that, that we never had when we were in the data center. But yeah, really understanding that, for instance, if you’re running out of a particular region, let’s say US west two, which is in the Pacific Northwest, that there are particular threats that the Pacific Northwest gets like earthquakes and forest fires. And, even though these are cloud, those buildings inside that region are a bunch of availability zones and inside those availability zones are a bunch of physical locations. And, if you Google around, you can kind of figure out where those physical locations are, even though AWS doesn’t want to let you know. And so, you still need to handle the case. That’s something. An actual disaster in most conventional sense of the word, an actual disaster could occur that will make it so that that area is unrecoverable or at least unrecoverable on the time scale that you’re interested in.

Mandi Walls: Yeah, absolutely.

Rich Lafferty: Yeah. And so, when you think about it that way, even though AWS has a bunch of guidelines for sort of thinking about the abstraction of the availability zone and thinking about the abstraction of a region and regions are far apart and you mentioned the speed of light, so there’s a certain latency between them. So, there are definitely for regular operations, there are some advantages to running within a region because you don’t have to wait for those protons to cross across the United States or across Europe or across the Atlantic or anything like that. The advantage, I guess, of thinking about high availability and disaster recovery is different problems, different but related problems is that one you get to optimize for normal operations and the other, you get to optimize for that Black Swan event and you don’t have to account for one in the other. And I think, that was probably what got in our way with the old way of doing things at PagerDuty, when we were doing synchronous replication across between regions with all those latency, latency hits and so forth and the complexity is the way of [inaudible 00:12:06] with them a little bit. And it seems like you can get an economy, a scale that you can get something for free because you … Why don’t we design one architecture that can handle high availability in disaster recovery? And then, that basically makes disaster recovery, a high availability function. And, we don’t have to think about having a disaster recovery plan anymore. And, that is a way of optimizing it. That’s something that you could decide to do that, but there are costs that come with it. And of course the biggest cost as an engineering organization grows is always cognitive look. Fundamentally, the thing about these complex systems, these distributed systems we’re designing is that they’re sociotechnical systems that contain software and people and the computer, I mean, there’ll be bugs and stuff like that, but it’s a lot harder to program the people sometimes.

Mandi Walls: [inaudible 00:12:58]

Rich Lafferty: So, yeah. So, the trade offs that you make and everything’s always trade offs. So, the trade offs that you make, when you decide to make your HA, your high availability, and your disaster recovery design really tightly coupled is that, one is a lot harder than the other. And, they have different needs and different requirements and so forth. And, you’re taking on the cognitive load of doing both at the same time. So yeah, when we kind of made that transition and we thought about, well, high availability is what our customers are depending on us every single day.

Mandi Walls: All day. Yeah.

Rich Lafferty: And disaster recovery is for this Black Swan event that in all reality will probably never happen, but we still need to have some insurance against. Then, you can use different techniques on both. And you can focus on the specific requirements of each. That for instance, in the high availability sphere, a failure needs to be essentially invisible to customers where with a disaster recovery event, then maybe making that visible to customers is fine. And, those trade offs, and it’s not trade offs that an individual engineer or even necessarily the engineering org can make.

Mandi Walls: Well, no. I mean, you probably want to talk to your finance department about what your insurance will cover and what you’re required to do for your business insurance, and all those other things come into it as well. It’s not just, okay, well, it’s easier for us to replicate in one place or the other.

Rich Lafferty: Exactly. And I mean, also this all ties into sort of senior business leadership, if that’s going up to your VP and your CTO and your CEO and so forth going all right. So here, we need to make some trade offs and they’re big trade offs. In what direction do we want to take this?

Mandi Walls: Yeah. And, it’s kind of funny because for some of the stuff, you can get really dark with your DR planning or climate change being a thing that we sort of know about. Are your east coast data center going to be partially underwater? And anytime, Sandy wasn’t expected, but those of us who had data centers in downtown Manhattan know that you can only take so much diesel fuel up the stairs when your data center is on the 20th floor and data center alley and Loudoun county gets tornadoes from time to time. I mean, all that stuff is out there. So, yeah. It’s a really hard decision to figure out where to put your resources and to apply the money and the time to figure out what needs to happen next.

Rich Lafferty: That’s exactly it. And it’s easy to focus on natural disasters because when you think about the word disaster, that’s kind of the first thing that comes to mind. And you have to think about natural disasters as well, especially, we do have a tendency to put data centers in places that are maybe not way they put them in the bay area or you put them in Pacific Northwest or you put them … Again, I mean, where the hurricanes hit on the east coast, I recently moved from Toronto to Nova Scotia. And so, I’m in the way of hurricanes now. And so kind of getting used to that, but there’s a lot of other things that could happen that would threaten the continuity of the business. And I guess, the other phrase, the phrase that goes along with disaster recovery is often business continuity. That’s kind of the key to the difference between those two things. Really, if you think about it, high availability is primarily to serve your customers. Disaster recovery and business continuity is serving your customers, but it’s also serving the continuity of the business. And the customers of course are relying on the business continuing, but so are a lot of other people.

Mandi Walls: Yeah. All your stakeholders, all your shareholders, all of those folks.

Rich Lafferty: Exactly. Yeah.

Mandi Walls: Yeah, absolutely. So, as you’re playing with these things and you’re making the golden path idea, so kind of thinking about making the easiest way to do things the best way, what kind of tools and day to day guidance do you give the rest of the engineering team for sort of applying these in their regular practice?

Rich Lafferty: Right. So, one of the reasons, it’s funny if you want to kind of dive down into the tech, one of the reasons that we’re in the process of migrating from the HashiCorp ecosystem and Nomad and Consul and so forth to Kubernetes is just that there is a huge ecosystem around it. And, because a lot of kind of handling especially on the high availability part, there’s a bunch of just kind of standard practices like using circuit breakers and rate limiting and load shedding that are just something that you need. I’m mean, they’re not easy to implement and they’re not easy to get right. But they’re still a thing that you can implement. And as much as possible, if we can make that, what we don’t want is for every team here to build those things into their software by hand. If you’re using whichever HTP library you may be using and it’s got a bunch of things for retry and exponential back off and so on and that’s all great. But then, the problem you have there is you scale up to hundreds of people in your engineering org and dozens of microservices and so forth is that you start losing the inconsistency. And a lot of people are thinking about the details of the stuff when they could be thinking of things that are more directly relevant to customers, if we could abstract it away. So, that’s where let’s say a service mesh comes in. I mean, obvious turns of service mesh. There’s kind of two ways through. You can do it in library. You can do it in infrastructure. If you introduce a service mesh and you have this basically consistent proxies between all of your services, then that’s a place where you can globally introduce a lot of those controls, a lot of those techniques to making sure that things stay available and you can ship with some good defaults. Shipping with good defaults is a huge piece of reducing cognitive load. And then, service teams are implementing a new service since there’s a service mesh between everything and they only need to tap into it. Then, they get a bunch of stuff for free. The other way to do it of course, if you aren’t in a position where you can just roll it to service mesh and we’re not yet. We’re not quite there yet. You can do it in libraries. And some of this is policy and some of this is technology. You say, if you’re going to be using HTTP connections or gRPC connections or whatever. You may be using thrift connections, God forbid. Then you say, well, you need to use this internal library and that library as a rapper for all of these things that I’ve been talking about. But then they happen inside the application. Of course, the disadvantage of doing thing with libraries is then, all of those teams that I mentioned and all of those microservices need to track a certain version of a library, and we need to roll it a change. Then, a bunch of teams need to rebuild their software and so forth. And if you do it in a service mesh, the advantage is that you can get all of those changes right away. And the disadvantage is that you can get all of those changes right away.

Mandi Walls: So, how does that play for the sort of service ownership model that we have? How much knowledge do our application engineers need to have of the underlying components and how much sort of falls to your team to take care of on a day to day basis?

Rich Lafferty: It varies. One of the things you want to watch and variability. I mean, I mentioned the cognitive load is an enemy for … And the other thing of course, and this applies to a lot more than just high availability. This is just for sort of building engineering teams, so cognitive love is one. And the second thing is variability where a lot of [inaudible 00:19:59]. you’ll look at your organization and you’ll see that a bunch of teams are doing well and a bunch of other teams aren’t doing well. And then, you kind of dig in, you go, “Oh, those teams are doing well.” They happen to have these particular engineers that have a bunch of experience and this, because what you really can’t do is you can’t count on individuals lifting. You can do in a 10 person engineering org. You can absolutely count on individuals in a 50 person engineering org. I mean, you need to formalize a little bit more, but you can still make sure that the right people are working on the right things. When you get into the hundreds, but along the thousands and bigger then you can’t necessarily rely on that anymore. And so, there’s actually a great example here. We rolled out Terraform a few years ago. Before we had Terraform engineers were poking around in the AWS console, which is like everybody. That’s where you have to start, right? Everyone starts there. And then, well, this isn’t very reproducible. And we don’t [inaudible 00:20:52]. And we also have some compliance requirements about having … We would depend on peer reviews as an approval as part of our … rather than having a change board. Peer reviews are one of our compliance controls for things, so, how can we make sure that infrastructure changes are producible, have a history, and have compliance controls attached to them. And while terraforms an obvious, well, infrastructure is code generally. It’s an obvious way to do that. You can use cloud formation, you can use [inaudible 00:21:17]. We use Terraform and it was great. We rolled out Terraform. We wrote a bunch of individual modules on … So here is the PagerDuty way to use S3. Here’s the PagerDuty way to use auto scaling groups and so on. And so, people not only were they doing infrastructures code, but they also were able to get a whole bunch of defaults in place. This is the way you have to encrypt us three buckets. This is the way that provisioning and instance works. So, plug that into your auto scaling group and things will automatically join clusters and stuff like that. But we actually, we introduced a new problem. And the new problem we introduced is that suddenly all of the engineers at PagerDuty had to know Terraform. On one hand, cognitive load went down a little bit because there was a history of infrastructure changes and people didn’t have to poke around the console to figure out what was going on. On the other hand, it went up because now there was a new technology that admittedly Terraform has some sharp edges that everyone had to understand. Kind of the things that people run into with infrastructure is code, all happen with Terraform and state management. There’s always the problem of, well, I need to remove one thing and I accidentally removed another thing. They’re really easy mistakes to make. So, the next step in terms of golden path here is to make it so that we can have all of those benefits without having hundreds of people, all having to learn Terraform. And so, the way that we’re doing that basically is to increase the level of abstraction. So, it used to be that there was really no abstraction on an AWS. And if you wanted an S3 bucket, you went in and you clicked the [inaudible 00:22:46] to get through an S3 bucket and maybe there was a Wiki page that told you how you should configure it. Maybe, there wasn’t. So then we moved to Terraform. You could use this module and so forth. We’re actually moving back to abstracting a way to … I just need an S3 bucket please. Just give me an S3 bucket. And, how that’s actually going to look? Not really sure yet. I know that we’ve got a team that’s been playing with backstage, but somehow or other, there will be something that [inaudible 00:23:09] we push button, receive S3 bucket. And then under the hood, we have all the things that we need to implement that in a repeatable and auditable and so forth way. But let’s make it so that we don’t have to write Terraform quote anymore. I mean, I say push button, but it’ll probably be something more like a high level configuration file that says, “Give me a bucket named this with replication.” And then, you don’t need to worry about the details of Terraform and of providers and of state management, because something takes care of that all for you. And so, you just lift that application up even further. And then, if a team does need something special, because there’s always a team that needs something, that needs to kind of go outside the golden path. It’s not, there are offenses, it’s just a golden path. Then they can drop down a level and do some Terraform. We’ll recommend against it. We’ll still support it, but just getting that abstraction level right so that they don’t have to worry about those details in.

Mandi Walls: Yeah, absolutely. And, it’s one of the things that we’re always try to impart to people is that you want to focus your resources on the things that are your core competencies. You’re building a product for your users. We don’t want you to spend a lot of time trying to figure out on every everyday basis for every single team, the specialized components of these S3 settings or God forbid I am or any of those crazy things.

Rich Lafferty: God for [inaudible 00:24:25]. Yes.

Mandi Walls: And I’ve been through, I am trading and I’m still like, whatever. Okay, fine. But, yeah. Focus your time on building better experiences for your customers and spend less time dealing with the minutia of all of this stuff. What I’m assuming is going to do for your engineers too, then like you’re taking away sort of this hands on keyboard experience with a product, of the product being AWS, but giving them more of a conceptual idea of here’s why you’re going to use a cloud provider for these things so that if we do have to move those lower level pieces, the abstraction still there, and they can still say, “I need an object store. I need a scaling group.”

Rich Lafferty: That’s exactly it. And, AWS calls that thing, undifferentiated heavy lifting. And I think, it’s a great metaphor. It’s like, undifferentiated, meaning you’re not going to get a competitive advantage out of doing it. I mean, heavy lifting is just … It’s work. But the undifferentiated part is really the important part. Another thing that we have to make sure the teams kind of have a … is exactly what you said, a menu of things. We call our platform standards list where there’s … If you want a database, there you go database. Well then, it’s like if you want, you can have any color as long as it’s black. You can have any database as long as it’s mySQL. And, that’s backed by having a bunch of mySQL experts and a bunch of making it really easy to do that with of course, exceptions coded in, coded in the policy, in the process, to allow teams to say, “Hey, we’re doing something different right up a design doc.” And, you can kind of propose what we want to … We need to use something else for this. And sometimes, the argument works. Sometimes the argument doesn’t work, but there’s at least that amount of flexibility. And it’s definitely been a challenge here to kind of … as we’ve grown and we’ve grown fast and a lot of companies like PagerDuty are also growing very fast. And at some point, I can’t remember the name of the triangle, but there’s a triangle which has autonomy and mastery and purpose. These are the things that kind of challenge for people in their job day to day. These are the things that people value. And, I find that engineering often over emphasizes the autonomy part, that it’s very important for teens to be able to choose their own thing. As you grow, you need to kind of balance away from that, the absolute autonomy thing towards the efficiencies of basically economies at scale. Finding that balance is always a challenge. But again, just to kind of tie this back into the high availability side of things. The more you can kind of make resources available to teams. I mean, you can make just building block that you can pluck together to get this for free. The less time teams have to think about building say, “I don’t know a durable queue” and the more they can spend building a really good incident response platform. And unfortunately, the queue example is not fictional in early in page history. We built our own. Now this is PagerDuty’s been around for 12, 13, 14 years now. And so it wasn’t as simple. Let’s just take Kafka off the shelf and so forth back then, but that’s it. We definitely overemphasized on the build it yourself thing in the early days as did a lot of other companies. And with that gains and cognitive logo did some really cool stuff. And honestly, a lot of that cool stuff was probably attractive to early engineer in PagerDuty. And, that’s an element and was probably also attractive to early investors because they could see what we could do and stuff like that. So, there’s a lot of things for the business, but it eventually got in the way. Now we use Kafka.

Mandi Walls: We’re going to pull the box off the shelf and we’re going to plug it into the thing and it’s all going to be fine. And then we don’t have to worry about [inaudible 00:28:04].

Rich Lafferty: Because it’s undifferentiated heavy lifting, having a … People aren’t buying PagerDuty. Customers aren’t signing up for PagerDuty because we’ve got a really good queue.

Mandi Walls: No. Right. It’s just like, don’t even care. And, as we discuss things like features in the product like business services with customers, the reports that are coming in from your users, aren’t like, like you say not that your queue is slow. That’s not what they’re seeing. So, you have to get that language around what matters to the customers and it’s actually important.

Rich Lafferty: That’s exactly it.

Mandi Walls: Awesome. Well, we’re just about at our time. We have a couple of questions that we like to ask folks. One of them, I think we probably covered. If you have a myth to debunk about rolling back to HA and DR …

Rich Lafferty: Oh yeah. I think the main one, there is that HA and DR are the same thing.. That’s really where people start to get confused. And, it’s a choice you can make. It’s a very specific strategic choice you can make to use the same processes for both. But I think, that’s going to get in your way.

Mandi Walls: Yeah, definitely. And then one other one that, that we sometimes throw at folks. What’s one thing you wish you had maybe known sooner about all of this that maybe you learned the hard way or didn’t know the deep secrets of until something maybe bad happened?

Rich Lafferty: Oh boy. That one’s actually really easy. And that’s the whole thing that you can’t take to people out of the system. You have to design these things to … A lot of people go, well, if we could get rid of the human element, then things would just work. If only, right? Let’s just eliminate human error especially when it comes to the availability side of things, the idea that it’s true that a lot of incidents are caused by changes, are triggered by changes. But the response side is, and a lot of companies are, “Oh, well we need fewer changes. Why don’t we ship change fewer changes to make them bigger?” No. That’s not going to work. There was usually this idea is you need to build a system that can defend itself from humans making mistakes. And it turns out that a lot of what you’re actually doing is building systems where the humans tend to be the thing that keeps it running. And so really, as you’re kind of building this all, building this all in acknowledging that the operators of the system are a component of the system and that you need to not just account for them, but embrace the fact, that one of the the complex systems are always failing and the operators of the system is an important component in preventing that failure from going from being latent to being customers. It took me a long time to get my head around that because obviously you want to build software that doesn’t need that. But when it comes down to it, if you’re building a complex distribute the system, you can’t.

Mandi Walls: You can’t get away from it.

Rich Lafferty: You can’t get away from it. You absolutely can’t get away from it. It’s not an option. It’s the people version of wanting to have both consistency and availability. It’s just that’s sort of, there’s a menu of two things, not three things. And, it’s the same way with this. It’s just like you can’t design the people out of the system.

Mandi Walls: No. Maybe in the far future, the promise of AI maybe, but by then, we’re all like soft little blobs floating around on chairs or whatever, but it’s certainly not our reality right now. So, for one parting piece, do you have any piece of advice for engineers that are out there on this journey, that things that they should be thinking about?

Rich Lafferty: That’s a good one. I mean, honestly, there’s a lot of thought in. So two things, I would say two things. The short thing is read the literature, that’s the little thing is get into the habit of reading. I’m not saying big computer science papers, but there’s a lot written out there about sort of the basics of how to do high availability, the actual implementation details, which we didn’t really talk about a whole lot. [inaudible 00:31:43] has a thing on their website called the builders library, which has a bunch of papers by their principal, distinguished engineers and solutions, architects, and so forth, which have a lot of gems in them that are really accessible. And there’s also, the other piece of that is there’s been a ton of writing, especially on the cognitive human factors in other industries. And software tend have a tendency to think that we’re special. No one else has ever encountered these problems before, but it turns out that a lot of industries have encountered these problems. And so, they’re not necessarily software running in the cloud. Kind of getting the learnings and understanding that some of the stuff that we think is a hard problem is a solved problem in other industries or at least that the thinking in other industries has moved far beyond where software has kind of got to on first principles.

Mandi Walls: Yeah, absolutely. It’s hard to think about software as being sort of immature. But as an industry, it absolutely is, and cloud computing being an absolute infant.

Rich Lafferty: Absolutely. And there’s huge advantages to that and there’s disadvantages as well.

Mandi Walls: Yeah. We don’t have to make everybody else’s mistakes, but only if we know about them.

Rich Lafferty: That’s exactly it.

Mandi Walls: Well, Rich, thank you so much for joining us today. This has been great. So, we’re signing off. We’ll see you again in two weeks, folks. Thank you. My name is Mandi Wallas and I’m wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageittothelimit using the number two. Thank you so much for joining us and remember uneventful days are beautiful days.

An Exegesis on HA and DR With Rich Lafferty

Transcript

Show Notes

Additional Resources

Guests

Rich Lafferty (he/him)

Hosts

Mandi Walls (she/her)