Chaos Engineering With Bruce Wong

Posted on Monday, Dec 9, 2019
In this episode host George Miranda chats with Bruce Wong, Director of Engineering at StitchFix, about practical ways of getting started with Chaos Engineering.

Show Notes

Creation of the term “Chaos Engineering”

Bruce tells us about how the term “chaos engineering” came to be and the mindset behind using the term.

“Let’s create a team strategy and vision around [Chaos Monkey and the practices around it] and let’s double down on what we already started. So in that fashion, we wrote a blog post that introduced the term ‘Chaos Engineering’ and introduced the term ‘Chaos Engineer.’”

What does Chaos Engineering really mean?

Bruce breaks down the pragmatic reasons this practice exists and why we should think about adopting it.

“It’s being proactive and getting a chance to validate our resilience design: finding out how well our systems are architected at 3pm instead of 3am.”

Chaos Engineering Thought Exercises

We discuss how tabletop thought exercises serve as a valuable tool to help you flesh out considerations long before touching any production systems.

“We call it ‘zero tech’ tabletops. I don’t want laptops. I don’t want distractions and excuses for why we can’t get started. And so I run these tabletop exercises, with a whiteboard, with a drawing of the architecture and we talk about our detection strategy, resilience, trade offs, and the parts that fail.”

But I’m not ready for Chaos Engineering!

A common response to the suggestion that a team adopts Chaos Engineering is that they’re simply not ready to get started. We discuss some ways to address these concerns.

“If we’re not ready for this, then are we really ready for production? Ready or not, failure is going to happen.”

Identifying big impact components to test

How do you prioritize which components of your stack to test? What are the considerations for figuring out where to start? Bruce gives some practical advice for where in your stack to start and finding opportune moments to seize upon.

“Cloud provider outages… are the best opportunities. They allow us to identify and be introspective about the things in our control that we can do about this.”

When should you start?

No, really. Big outage aside, when should we get started? Here’s where we see George’s managerial background kick in. Can we start today? Bruce provides some great practical wisdom around getting started as early as when new team members are being onboarded.

“When’s the time you want to start writing more resilient software?”

When the real outage happens

It’s important to celebrate wins. The Chaos Engineering wins are when you’re the team relaxing as a failure happens.

“You’re celebrating because this thing failed exactly as we planned! It happens and there’s nothing for us to do. We’re just sitting back and watching the show.”

Capturing what we learn from Chaos Engineering

Building more resilient systems means taking the things we learn from Chaos Engineering exercises and ensuring that resulting action items make it into our work streams. How can teams do that successfully?

“The first time I did this, we did sprint planning and then the chaos engineering exercise. Nope. That’s the wrong order!”

Parting Advice

Bruce wraps up with practical tips for moving your teams in the right direction.

“You don’t need fancy tooling. You need 3 lines of code: if my user, fail this call.”

Guests

Bruce Wong

Bruce Wong

Bruce Wong is director of engineering at Stitch Fix. He formerly resided at Netflix and Twilio, where he founded the Chaos Engineering effort to stress and proactively introduce failure into critical production systems to validate resilience. He is passionate about tackling challenging problems, scaling engineering teams, and building compelling products. In his spare time he can be found applying engineering principles to iterate on BBQ and chocolate chip cookies.

Hosts

George Miranda

George Miranda

George Miranda is a Community Advocate at PagerDuty, where he helps people improve the ways they run software in production. He made a 20+ year career as a Web Operations engineer at a variety of small dotcoms and large enterprises by obsessively focusing on continuous improvement for people and systems. He now works with software vendors that create meaningful tools to solve prevalent IT industry problems.

George tackled distributed systems problems in the Finance and Entertainment industries before working with Buoyant, Chef Software, and PagerDuty. He’s a trained EMT and First Responder who geeks out on emergency response practices. He owns a home in the American Pacific Northwest, roams the world as a Nomad with his wife and dog, and loves writing speaker biographies that no one reads.