Global Health, Incident Response, and Chaos Engineering With Jason Yee

Posted on Monday, Mar 9, 2020
In this episode, Julie Gunderson talks with Jason Yee, Director of Advocacy at Gremlin about the current state of affairs with COVID-19 and the similarities between incident response and chaos engineering.

Show Notes

The State of Current Affairs:

Julie and Jason talk about how COVID-19 is affecting the conference industry and how we are adapting our work.

Jason Yee: “There are other implications of how do you operate as a company when you are impacted by outside forces such as viruses and outbreaks, and what does mean for things like disaster recovery and resiliency and not just for your systems but for your people systems.”

The conversation turns to how things are moving to online options and remote work.

Failure of Imagination

Jason and Julie talk about how you imagine what failure looks like.

Jason Yee: “Failure of imagination means that we often fail to think about ways that things can break, and in hindsight they look fairly obvious.”

Jason goes on to talk about how do you think about failure and how do you imagine what failure states look like. Jason and Julie talk about how PagerDuty tests for failure through Failure Fridays, and how we can translate learnings into less technical learnings. Chaos Engineering with People

Jason talks about the practice of chaos engineering with people, and ensuring that knowledge is distributed.

Jason Yee: “If we actually spend time and imagine what our processes would look like by sort of messing with the people [vacation and schedules] in it we could probably come up with some more interesting ones as well.”

Julie and Jason talk about practicing for failures and disasters and how practicing leads to comfort and the reduction of chaos in actual emergencies and incidents.

Early Signals

Jason and Julie talk about what we can learn from our systems and the past and how we can learn from those things moving forward. Jason talks about the three categories: Work Metrics, Resource Metrics, and Events and how early indicators feed into larger objectives.

Jason Yee: “What are those early things that I can monitor and take a look at that contribute to the overall objectives, and if I can monitor those indicators and get advanced warning on those to see if something is potentially wrong, then I could potentially head off issues before I violate my objective or agreements.”

The Right Way to Wash Your Hands

Jason and Julie talk about how proper handwashing is 20 seconds and the songs you can sing while doing this, but more so how it’s about practicing doing things the right way so that when you are in an emergency situation you don’t have to retrain bad behaviours.

Jason Yee: “Practicing correctly should be the same process as what you do in real life, it shouldn’t just be a response to like “oh now we’re going to do a different process because it is a real virus” or “now were going to do a different process because our critical systems are really down” versus what we are doing when we practice chaos engineering.”

The Right Methodologies

Jason talks about the methodologies behind chaos engineering.

Jason Yee: “In terms of the systems we build the methodologies really come down to, when it comes to chaos engineering; make that practice rigorous, come up with a good hypothesis, be rigorous about how you test that.”

Jason continues talk about how you need to test in a scientific and repeatable way, and how you need to do things in the same ways to have the same effect when you are testing.

Chaos Engineering Doesn’t Have to be Scary

Julie and Jason talk about concerns organizations can have around chaos engineering, and how chaos engineering doesn’t have to be scary when you implement the right methodologies.

Jason Yee: “With Failure Fridays and chaos engineering, you want to start small. So you want to start in your development environment, and with little bits of your components.”

Jason gives us advice on how to build up to staging environments and to production with chaos engineering.

Additional Resources


Jason Yee

Jason Yee

Jason Yee is Director of Advocacy at Gremlin where he helps people build more resilient systems by learning from how they fail. Previously, he was Senior Technical Evangelist at Datadog, a Community Manager for DevOps & Performance at O’Reilly Media, and a Software Engineer at MongoDB. Outside of work, he likes to spend his time collecting interesting regional whiskey and Pokémon.


Julie Gunderson

Julie Gunderson

Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.