Observability Engineering With Honeycomb

Posted on Tuesday, Sep 7, 2021
All-star trio Charity Majors, Liz Fong-Jones, and George Miranda join us this week to talk all things observability, observability engineering, and their new book on the subject.


Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting their system. I’m your host, Mandi Walls. Find me @LNXCHK on Twitter. All right, welcome, welcome, welcome. Today I have a few folks with me for this episode. Today, I am also joined by…

Charity Majors: Charity Majors, @mipsytipsy on Twitter.

Liz Fong-Jones: @lizthegrey

George Miranda: And George Miranda, @gmiranda23 on Twitter.

Mandi Walls: Excellent. So we have our three guests are with us today from Honeycomb, and we’re going to be talking about observability. So we heard not about the state of observability in 2021. One of our earliest episodes, Christine Yen came on, talking about observability, and we thought it would be great to follow up. It’s been almost two years since that episode. So you also work at Honeycomb, that’s where Christine is. And you’ve been working on co-authoring a new book from O’Reilly. Yay. Titled observability engineering. And over the past couple of years, observability has really become a popular phrase and something that more people are at least talking about, whether they’re thinking about it or making plans to make inroads in it for their own projects.

So today we can dig into some of the hype and maybe uncover more of what it means to practice observability and how that helps you better understand all the production systems that you might having, ecosystem. So we’ll put the earlier episode in the show notes for folks who haven’t listened to that one. So you can go back and check on it and make sure I’m missing anything. So we covered the bit of the basics then, so we’ll go beyond that. So how about we start with, how is observability different from monitoring? What’s the most basic understanding for folks who are really just coming at this, but know what monitoring is, maybe?

Charity Majors: This is why I think that it is important for us to have a different term because there are a lot of best practices and there’s a lot that is actually the opposite of monitoring. Monitoring, you’ve got one process checking on another, going, how you doing, how you doing? It’s fundamentally about the state of the service. Is my service healthy? It’s often used for capacity planning. What percentage of utilization am I?

And it really is, it continues to be the gold standard for understanding infrastructure. From the perspective of the people who have to do rack servers or from the perspective of people who are… Do we have enough Postgres capacity for our clients that’s been sum up, that’s the gold standard. Observability is much more about the code that you write and ship every day. It’s about the code that is the lifeblood of your systems, if it were. And it’s about unknown unknowns, fundamentally, it’s about, can you understand… And this is inherited from the old 1960s definition from control theory, mechanical engineering. Can you understand what your code is doing? Any interstate of the system that can get itself into, without having to ship new custom code to handle that, because that would assume that you could have predicted it in advance.

Can you understand any novel state just by observing it from the outside, just be asking questions from the outside. And if you accept that definition, which I think increasingly a lot of people do, they accept that there’s a difference. They accept that observability is about, if monitoring is about the what, observability is about the how. Then it turns out there was a lot of technical things that proceed from that. You have to support high cardinality, you have to support high dimensionality. As far as we can tell, these are basic fundamental prerequisites to answering any question about your systems. But if anyone wants to prove us wrong and show that there are additional ways, we are all yours. Fundamentally it’s about the unknown unknowns.

Mandi Walls: Yeah, definitely. So we’ve got the difference between poking the things that we already know were there, and digging into the things that we don’t know that we didn’t know.

Charity Majors: Tell me about your entire existence from start to finish. Yeah.

Mandi Walls: It’s very existential. It’s a very philosophical question there. So there’s a lot of some misconceptions, a lot of, maybe some lost in translation kinds of things about observability. What’s a myth or a common misconception about observability that you’d like to debunk?

George Miranda: The myth that I want to bust is that observability is not about three pillars or even four pillars. I think that is the worst possible way to think about observability. So the three pillars definition, logs, traces, metrics, sometimes includes events. That definition is just focused on data types. And it’s particularly misguided because those three things, those four things are all basically the same data. An event is a record of a discrete amount of work that happened somewhere inside your system, which is essentially the same as the log entry. And so, assuming that your log data is structured and includes aeration, then those two things are exactly the same. And traces are just more of that data. Like add in a trace ID and apparent ID and a span ID, and you can connect all of those events together and voila, we just created a trace.

So a trace is just a series of interrelated events. And then metrics, metrics can be calculated by adding up all the logs that occurred over a specific period of time. You want to know how many concurrent threads were running from 8:06 AM to 8:07 AM. Great. Take all the events within that time span and duration and sum them up, and there’s your metric value. And so, as an infrastructure engineer, that was the eye-opening part for me, that like, oh, this is actually all the same data. And yet we treat it so differently. And that separation is what has kept us from seeing that it’s all just one system that we care about debunking and understanding. But when we treat that data really differently and we break it up into separate tools, into separate views, we lose this cohesive context about the thing that we actually care about.

And so, to Charity’s point, when something is wrong inside of your application system, and you don’t know what it is. You need to slice and dice that data and analyze it in whatever way you need to uncover hidden sources of anomalies. And that’s the breakthrough. Realizing that after decades of using separate tools to look at discrete parts of our stack, that it doesn’t have to be that way anymore. That we have a better holistic available approach today. And so that three pillars definition, it completely undermines that insight. It reinforces those traditional views.

Liz Fong-Jones: It kind of confuses the how rather than the why. The why is, we want engineers to be able to more efficiently run these large and sprawling complex systems and that the older techniques of trying to enumerate every possible failure case, they just don’t work anymore. That these techniques of, oh, more data will just solve our problem. Instead of adding more data, we need to think about being able to better look at it and examine it and carry the data you already have.

Charity Majors: Another interesting way to look at this, from the perspective of an infrastructure engineer, is that it became necessary for us to develop a new way of doing things when microservices came about. Because before that you had the app, the web, the database. And you could pretty much predict 80% of the ways the systems ever going to fail and make dashboards from there and go home. Well, and all of the complexity was inside the app. So if all else fails, you could just attach a GDB and step through it. Well, you can’t do that anymore because you’ve blown up the monolith and now your request goes, hop, hop, hop, hop, hop, hop, hop, hop. And when we started doing this, five, six years ago, there was no way of carrying that context. So part of what observability is, is instrumenting your code in such a way that you gather up that context and you ship it along with the request as it goes all the way through your services.

Mandi Walls: Yeah. Super interesting. It coincides, as the co-dependent evolution of the way the code has changed and the way that tools and debugging and all of those other things have to change downstream of that. So the change in debugging from intuition, and like you say, attaching your debugger to the monolith versus going through more traditional dashboards and all those kinds of things. What kind of evolution do you find for engineers who are starting to think about, you have this distributed system, you have these observability tools now, what’s your next step with those things? How do you get into that next evolution?

Liz Fong-Jones: It’s kind of this interesting question of, is this a Greenfield or Brownfield situation? I think that those two paths are very different. For the Greenfield case, I think it’s really important to start implementing the right primitives, to start off with a system like OpenTelemetry that handles all of your trace propagation. So you can just add key value pairs anywhere in your system, and they’ll get automatically added to three stamps.

But maybe you’re a world where you can’t start with your instrumentation from scratch. Maybe you’re retrofitting an existing system. And I think for that, our answer is, you probably already have loglines, you probably already have application metrics. How can you bring those things together? How can you add more structure? So it’s not something that you’re going through with a full tech search tool. But instead, we’re able to pick out all the fields and you’re emitting one logline per request and reporting the request ID. Those are some of the steps that make your data more queryable and more efficient. So thinking about what is this data, what can I keep, what should I not be keeping?

Mandi Walls: Okay. Interesting. So for a lot of folks that might be, you’re actually changing the way they’re doing their logs and putting those things together, maybe adopting some new practices and style guides and things like that they had -

Charity Majors: Ideally. Right. Ideally everyone would a UID that has persisted all the way through their stack, et cetera. But you could also come a long way just using some of the postdoc processors. What’s the name of that one started by the Splunk folks?

Liz Fong-Jones: Cribl?

Charity Majors: Cribl. Yeah. Cribl is a tool, that you can just attach to the stream at the outside and it will reconstitute your logs into events for you. There’s plenty of proscripts out there doing this, right as we speak.

Mandi Walls: Yeah.

Liz Fong-Jones: I think the important thing is that on the conventionalization side, if your tooling doesn’t show people value from that experience of making wide events, then people are not going to do it. It’s kind of, you have to offer people carrot. And that carrot is, you can understand your system better. You can understand your systems faster by being able to examine along these dimensions that you weren’t previously able to query for.

Charity Majors: The hardest problem in distributed systems is always, is rarely what is the code, it’s usually, where is the code that we need to debug. And that’s what something, like Honeycomb or any real observability tool, does so well is isolating where that is coming from. And what the errors have in common, say you see a spike of errors on your dashboard with metrics, then what do you do? Well, you usually go flipping through pages, looking for more dashboards that look exactly like that, looking for the same curve. That’s not exactly science. That’s not debugging, that’s visual scanning. It’s pattern matching. Sometimes it works, but usually it doesn’t, or it tells you only part of the story. Whereas, with Honeycomb, you can click on that and you’ll immediately get like, if you’re using bubble, we precompute for all dimensions inside and outside the bubble that you selected. And then we sort them to say, oh, that thing you care about has these five things in common.

Maybe this iOS version, this endpoint, these free traces, whatever it is. It’ll tell you what exactly is in that error spike, or as Liz will tell us, you can start with the top level of, here’s my SLO, here’s my SLO violation. And then you click on it and you see exactly what the events that are violating the SLO are doing, which is kind of a game changer. We’re so used to sitting here, looking at our dashboards and then flipping over to try and visually correlate to our logs. And, oh, these seem within five minutes of each other, it’s probably the same thing. Thumbs up. And then copy pasting an ID over into our tracing tool. And it’s just not tied together by anything except the operator’s guesses.

Liz Fong-Jones: And I think the interesting thing there is that a lot of the automation solutions that have reached the market are focusing on the wrong part of the problem. Instead of focusing on empowering people to understand their data better. Instead, they’re taking autonomy and control away from people. Whereas a human being might have context as to which metrics might go together, a machine doesn’t necessarily have that context. It can’t tell what’s a cause and what’s an effect. So when you unleash a machine learning system to tell you, “Hey, tell me when there’s any anomalies”, that machine living system will wake you up at all hours of the night. And there won’t necessarily be a user impact related. And it’s just your engineers are being woken up by the system that has turned full skynet. And I think that’s not okay. I think that instead we should be focusing on empowering users and servicing the most relevant data to them when they ask for it.

George Miranda: And the thing that’s different with observability is that, that empowerment comes from enabling you to debug from first principles. So to Charity’s point, rather than having this familiarity of dashboards or the patterns that you’re looking for, you can objectively look to see what is this application really doing? And the aha moment that I see with folks and when observability really clicks, is that moment when they look and they start seeing discrepancies between what they think their code is supposed to do and how it operates, based on a spec or an architectural diagram, and what it’s really doing in the real world. And we call that the mean time to, what the [bleep]? And lowering that right in, and just seeing like, oh, this is reality. This is what’s really happening. That’s the power. And that’s the game changer.

Mandi Walls: For folks just getting started on these kinds of things. So you’ve mentioned a couple of other practices that hang along with this. Is this something that requires a bit of sophistication or a bit of engineering and maturity to embrace? Or are you seeing folks that, maybe their last generation game wasn’t up to the A-class, but are getting into observability and reaping big rewards there?

Liz Fong-Jones: I love that a specific example you used because we actually did an interview on o11ycast, the podcast that a number of Honeycomb employees co-host, with Nick Herring from CCP Games. And CCP games runs EVE Online, which is a 20 year old game. And they have been modernizing with Honeycomb, and that’s really exciting to hear about it and to see. But I think the important thing there is, they release client updates to their game at least once per week, sometimes multiple times per week. And this is with a massive install base across tens of thousands or hundreds of thousands of real people’s computers. And I think that if they can do it, that is a sign that your organization can ship every week, at minimum.

And I think that the place where people do struggle is, let’s suppose that you are excited about observability, but your organizations changes software production once every six months, even if you don’t have to write a single line of instrumentation, even if there’s a magical button you can turn on to get the insights. If you’re not able to change your code to fix the problems you’re seeing, and then go back to find another round of problems and fix. If that feedback loop is not tight, you’re not going to see benefits. You’re much better served, accelerating your delivery process and shifting left and lowering that cycle time before you really start to embrace the durability.

So I think that’s what we perceive as the table stakes is, you have to be releasing to production every three months, every month. Charity’s going to argue and say, you know what? I want to bust. But I think that observability only starts becoming practical and observability only starts giving you insights you can act upon immediately, when you’re prepared to act on those insights.

Charity Majors: No, that’s an excellent point. Absolutely. Fantastic.

Mandi Walls: So all of this stuff that you’ve been talking about, all the things you’ve been distilling into all the knowledge that you have, what are folks going to find of all of this in your new book? So it’s in previews on Safari. I added it to my shelf, but haven’t started it yet, but I’m looking forward to that. But what are we going to learn in the new book about all this stuff?

George Miranda: I think the way the book is coming together, it is step-by-step unpacking, why observability? Why is this necessary? And what are the fundamental building blocks and how do you start incrementing your processes that you’ve done in the past? And what things do you start doing entirely new? And basically methodically making a case. And I think whenever you introduce a fundamentally new practice and something that is so differentiated from something like monitoring, and we’ve been doing it a particular way for decades. The burden of proof is on you to show that this is indeed a better way. And to methodically make that argument in more than a blog post or a Twitter thread or even this podcast. And so, I think the length of the book is actually sufficient to do that, kind of like a start to finish sort of way.

Mandi Walls: Will it serve as a workbook for folks that are looking to get started? Is it more theory?

George Miranda: It’s a mix of both. I think there’s a lot of introductory, why, and then there’s a really lot of meat in the middle of how, and specific code examples and showing differences in, for example, SLO data types. And why you would use observability data.

Liz Fong-Jones: I think it’s not just technical though. We’re showing you not just how to do the technical parts, we’re showing you how to do the cultural parts. I think the cultural parts are the most important bits, because what we do if we’re at a steep dive of, for instance, how you would build a data store. The argument is you shouldn’t have to, but if you really want to understand, how does this work? How is this not snake oil? We unpack it all. Our goal is to help you understand how we came to the conclusions that we in, as to why this is the best way to solve this problem.

George Miranda: And to Liz’s point, also to encapsulate, what are the capabilities that enable observability and what does that mean? And how is it more than just, it’s not a synonym for monitoring, it’s not a synonym for telemetry. It is, that ability to understand your system in new ways. But what does that mean? It means there are a lot of different facets to that. And so that’s what we try to cover.

Mandi Walls: Excellent. Sounds great. Yeah. So we’ll put a link in the show notes for folks who are interested in getting an early look at that. And so, looking into the future, none of us know what’s coming next, whatever the next exciting thing might be, there’s certainly a bit of groundswell around things like serverless. It seemed like they would play into this very well. Other technology is another movement. What do you feel is next for observability? What are the next pieces of new horizon or new ground that could be cut for observability out there?

Liz Fong-Jones: I think the main thing that we need to see first is we need to see this convergence around open telemetry and open standards, that people have been using so many different disparate systems, different telemetry protocols. And once we all converge on one way of propagating context, across potentially many different signals, including metrics, including traces. I think that, that’s going to help a lot. So the good news is, a OpenTelemetry tracing what GA recently, and OpenTelemetry metrics is going to follow really soon. But I think that we need to see that 80% to a 100% adoption, rather than the 20% to 60% adoption that you would see with our current CNCF, with our CNCF intubated status.

Charity Majors: YI do think that it plays a role. And I also don’t think… Observability is not synonymous with OpenTelemetry. You can use OpenTelemetry to accomplish observability. You could also use OpenTelemetry to accomplish monitoring stuff. It’s pretty agnostic there, and which is part of its strength. Where OpenTelemetry helps is in reducing the barrier for users to try out different services and back ends like that. So it’s helpful there, but it’s not really necessarily part of observability. I just don’t want people who start using OTOM to go, cool, now I have observability because it’s pretty distinct.

Mandi Walls: Yeah.

Charity Majors: I think that there’s next gen stuff. A lot of it has to do with lowering the barrier to entry for users making it, so that it’s their native language that they pick up. So they don’t have to unlearn all of the old ops stuff and then learn the new stuff. It’s so much more intuitive and easy for new grads to just start out with something like Honeycomb than it is for old people like us who have been like, okay, I have to unlearn 20 years of ganglia here before I could start to comprehend this. There’s a whole wave out there of production first toolset. These startups have started in the last five years. It was really exciting to me that the center of gravity is really shifting to tooling for production, understanding production instead of, like us and Gremlin and LaunchDarkly, although these companies are [inaudible 00:21:34].

Liz Fong-Jones: The other one that I wanted to mention there is Century. I’m really excited that Century is introducing front-end devs for the first time to this idea of distributed tracing. Like, oh my God. Yes.

Mandi Walls: Yeah. So much of that stuff is matured. And like you say that, I love the idea of the production first thought motto. Thinking about software as something that you’re going to put in front of a user and not something that you’re just deploying to a package in an artifact repository and just sits there.

Charity Majors: Training in infrastructure engineers, like real people is a novel thing for… Yes.

George Miranda: And to Charity’s point about new college grads and new ways of understanding your system. I think when you look at software development, there is a true and tried pattern for understanding discrepancies between what your code is supposed to do and what it does versus a test. But then as we approach production, it just becomes the wild, wild west. And like, who knows what’s going to happen in that chaos? And it doesn’t have to be that way. And I think adopting an observability first mindset is about realizing that, oh, there are actually solid ways to track that and see that and analyze it, and figure out why is this different when billions of people are using it across thousands of different devices in different geos, in different ways, versus a sterile set of lab tests. And that transition is, I think that is what we want to see proliferate. And that’s where observability is going.

Charity Majors: Exactly. TDB is the most successful software paradigm of my lifetime. And I feel like, yes, observability driven development will incorporate lots of tests, but also with the understanding that, you don’t know [bleep] until it’s running production. As Liz was saying earlier, shrinking that time, that interval between when you write the code and when the code is live, so that it’s predictable and swift so that you could be developing a constant conversation with your code, is actually a really important part of observability, because if you’re instrumenting your code you’re like, great future me is going to want to know these things, but then it ships in a month. You’re not going to remember. You don’t know, no idea what’s going on. Like in part the genius and magic of writing software with a very tight feedback loop is that, you could define most bugs before they make their way to users because you’re instrumenting, you’re looking at it, you’re instrumenting, you’re looking at it. And you’re asking yourself, is it doing what I expect it to? And does anything look weird?

But I’m really stoked that this next generation of engineers, like the debate over whether or not to put software engineers on call or not, seems to be concluded, decided, yes we do. If you develop a 24/7, highly available service, you should own your software production because there’s really no other way. It’s not that we want everyone to be just as miserable as ops has been.

Mandi Walls: But we do.

Charity Majors: Well, shhh, Mandi. Inside voice. But rather, this is the only way that we actually make things better so that nobody has to be that miserable. Right now, people are shipping codes that they don’t understand out of these hairballs of stuff that nobody’s ever understood. And they’re crossing their fingers and backing away slowly. That’s no way to live your life.

Liz Fong-Jones: And they’re making it the QA department’s concerned. They’re just like, if this breaks, it’ll get kicked back and [crosstalk 00:24:51].

Charity Majors: Kick that ball over the sun. Lower class developer group or something. Yeah. It’s not cool.

Liz Fong-Jones: I think someone was describing to me the other day, like how the Oracle SQL databases is developed. And it’s just like, it’s this giant monolith and people write all these acceptance tests and they blow up anyways and they get kicked back and they have to look at it in a month when QA kicks it back in. And it’s like, that is no way to live.

George Miranda: It’s absolutely not.

Liz Fong-Jones: And people think that, oh, it’s so great that I don’t have to be on call. But the reality is they’re paying that cost anyways.

Charity Majors: There was a study that Facebook did that showed that from the moment you write a bug, you write it, you type it out, you backspace, you fix it. Cool. That’s the fastest you could ever fix a bug. The longer the interval between when you write the bug and when you discover it, the cost of finding and fixing, goes up exponentially.

George Miranda: And as someone that has been on call for a majority of their career, until I made the switch over to vendor life, the game changer is realizing that duct tape and popsicle stick approach to production, that we were just describing. We are conditioned to live with that and think that that level of not understanding is normal. And that, that goes with the turf. And this is just the way the things are. And the moment you realize it doesn’t have to be that way, and you can actually see what’s really happening and understand production. Yeah, exactly. That’s the moment when you’re like, I could never go back. [crosstalk 00:26:16]

Liz Fong-Jones: I am never working another job again, without having observability.

George Miranda: Totally.

Charity Majors: Which points, as we should keep putting out, which points to how compelling it is when it comes to recruiting and retaining good engineers. Once they’ve gotten a taste of the good life, they’re going to ask these questions of all future employers.

Mandi Walls: Absolutely. So for, for new folks that are out there, they’re just coming on board with all this stuff. I hate to try and dash their hopes with horror stories. But at the same time, they’re cautionary tales. I still find places where I want to say to the people, get a different job. Why are you sitting here with this? So yeah, now the golden future looks so much better than what the mid 2000s do.

Charity Majors: And there’s this really unfortunate mindset that I think is really prevalent where people are like, that would be great, but that’s not for me. I don’t get nice things. That’s just for the Googles and the Facebooks, it’s for good engineers, which is such bull[bleep].

Liz Fong-Jones: What’s that thing that you say Charity, the thing about how engineers rise or fall to the level of their team?

Charity Majors: Yeah. Basically, within a couple of months of joining your team, your level of productivity will rise or fall to join that of the team. You don’t get to be a special snowflake who just leaps over the entire deploy process. God, I hope you don’t. This is the easy way to build and develop software. The hard way is the way people are doing it now, if you could develop software the way you’re doing it now, you are more than capable of developing with observability.

Mandi Walls: That sounds fantastic. We are just about out of time. Is there any one little nugget of wonderful knowledge that you’d like to close with or anything you’d like to remind people or ask them to check out after they’re done listening -

Charity Majors: We have a free tier that never expires.

Mandi Walls: Oh, that sounds fantastic.

Charity Majors: It’s for people to run production workloads. It’s pretty sizable, and none of this one month bull[bleep].

George Miranda: And here’s what I’ll say. The nugget that I would go with, I guess, building on that is, that moment that I was talking about when you see the discrepancies between what’s really happening and what you thought was happening, that is when people start to get it right. So if you can use a free solution and start sending in your own application data, that’s when it starts to click. We can describe it all we want. We can tell you all about it, but until you see that and realize, oh, I can really understand for the first time, it’s not going to click until then.

Liz Fong-Jones: And even if you’re not willing to send us your production app data, send us your build the data, on why your build is so slow. We guarantee you, honor, it’ll fit our free here. And you will find ways to shave 10 minutes off with every single build.

Charity Majors: You will find so many bugs that have existed for so long and you just never knew.

Liz Fong-Jones: But yeah, the thing I wanted to close with is, serviceable objectives. If you are paging yourself, because the CPU goes above 90%, you don’t have to live like this. Only page if their customers are actually in pain.

Mandi Walls: Absolutely. Customers first, customer experience first. Fantastic. Well, we’ll put all that stuff in the show notes for folks listening, check that out. We’ll link to Honeycomb sites and the preview of the book and all the other exciting things that going on. So thank you so much, all three of you for joining me today, this has been fantastic. Super great to learn all this stuff. And so, we’re signing off. This is Mandi Walls and we are wishing you an uneventful day.

That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you liked what you’ve heard. You can find our show notes at www.pageittothelimit.com and you can reach us on Twitter @PageIt2TheLimit, using the number two. Thank you so much for joining us and remember, uneventful days are beautiful days.


George Miranda

George Miranda

George Miranda directs Product Marketing at Honeycomb, where he helps people improve the ways they run software in production. He made a 20+ year career as a Web Operations engineer at a variety of small dotcoms and large enterprises by obsessively focusing on continuous improvement for people and systems. He now works with software vendors that create meaningful tools to solve prevalent IT industry problems. He’s a former Page It to the Limit host.

George tackled distributed systems problems in the Finance and Entertainment industries before working with Buoyant, Chef Software, and PagerDuty. He’s a trained EMT and First Responder who geeks out on emergency response practices. He owns a home in the American Pacific Northwest, has no idea where home is anymore, and loves writing speaker biographies that no one reads.

Liz Fong-Jones

Liz Fong-Jones

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Charity Majors

Charity Majors

Charity is professionally caremad about computers. She is an operations and database engineer and sometimes engineering manager. Right now Charity is the CTO and cofounder of Honeycomb, builders of observability for distributed systems. (“Monitoring” doesn’t have to be a dirty word; give it a try.)

Until recently Charity was a production engineering manager at Facebook, where she spent 3.5 years working on Parse (both pre and post-acquisition by FB). She also spent several years at Linden Lab, working on the infrastructure and databases that power Second Life, and is the co-author of “Database Reliability Engineering” by O’Reilly.

Charity was a classical piano performance major in college, but dropped out because it turns out she prefers not being dirt poor. She has been building systems and engineering teams ever since.

Charity love startups, chaos and hard scaling problems, and somehow always ends up in charge of the databases.


Mandi Walls

Mandi Walls (she/her)

Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.