Alex Solomon (CTO and Co-Founder of PagerDuty) kicks us off with a definition of real-time operations and why it matters.
Alex: “Real-time operations to me, what that means is, it’s about dealing with problems and incidents and alerts in real-time. Making sure that the right people are pulled in whenever you have an issue with your production software, and only the right people. Those teams and individuals are looped in quickly, looped in via multiple channels to make sure they get there fast. Then once they are paged and looped in it’s about collaborations, it’s about communication, it’s about coordinations, it’s about defining clear roles for all the individuals and making sure they can collaborate and communicate effectively to make decisions quickly and resolve the underlying problems with those systems.”
Matt hops in to discuss that real-time operations also encompasses how we learn about incidents and how we continue to learn from them.
George talks about how real-time operations extends to every facet of online operations that might impact our team, whether it’s web services or code we write and how it operates in production, and how the definition of real-time operations is very broad.
Alex talks about the main myth he sees with real-time operations.
Alex: “The myth that you can buy a software platform like a PagerDuty or a DataDog or a New Relic or any of these toolboxes that we all have when running digital systems, and that buying the platforms will solve all your problems and be a silver bullet. In my experience what I see over and over is that yes you can buy the platform but the hard part is changing culture and transforming culture and transforming the way people work, and that comes down to people and process.”
Alex goes on to mention that it’s about the people supporting the services and full-service ownership.
Matt talks about the myth that we can prevent failure.
Matt: “The reality is we can do a lot to kind of steady ourselves and be ready to respond and take information we’ve already had, but our systems are so complex there’s no way to be fully predictive, and we need to understand how to make our systems - our socio-technical systems - more resilient rather than thinking if we just build in enough failover, enough automation, or write the best runbook ever, will be able to prevent failure.”
The discussion moves towards how systems are designed for failure, and that we have ways to detect problems and rectify them quickly so we can detect and resolve problems quickly.
The conversation moves to what we have each learned during our collective time at PagerDuty, whether it is the incident response process or postmortems.
Scott talks about how his time at PagerDuty has been entirely remote and how to be successful as a remote worker by being vocal about your wins, taking time for yourself and helping others learn about what you are doing by being an internal advocate.
George mentions that advocating internally and externally is about how you communicate with different folks that are distributed.
Julie discusses her experience with this being her first remote job and how the PagerDuty culture of having video on all the time makes being remote much easier by helping to build a great team relationship.
The conversation shifts to how real-time operations are impacted by the shift to remote work.
Alex discusses how in the last 20-30 years it was about data centers and folks being on-site, but with remote tools companies have the ability to move to remote easier. However, the challenge and gap can be the culture of remote work if teams and companies aren’t used to that experience.
Julie talks about what it is like to work remotely with families in our homes while we work. She mentions how she packs her son a lunch like she would have if he was physically going to school.
Matt offers his story of how he has trained his kids to understand that he is working when he is at home.
Matt: “What I used to do is I used to wear a special baseball hat if I was going to be in the main room and it was like, if daddy had that hat on he was working, and for all practical purposes he was invisible, and that worked about half the time.”
Matt continues to talk about how we can be empathetic towards our co-workers and get to know them a little better.
Julie shares the biggest learning for her at PagerDuty is that:
Julie: “Every organization feels they have a very unique story to tell, but it’s not as unique as they may think. A lot of these organizations, they may have a different journey but they are still on kind of the same level as to what they deal with.”
Julie goes on to talk about how organizations are dealing with a lot of HybridOps situations.
George hops in to discuss how his background as a first responder applies to managing real-time operations:
George: “A lot of that comes down to preparedness, to having a plan, to knowing what you are going to do when those unexpected surprises come up.”
George continues to say you cannot plan for everything, such as COVID-19, but you can have repetition and practice around when a type of crisis occurs.
George: “Having a plan is not about following that plan to the letter, because we never know what we are going to expect. Real-time operations is completely unpredictable, but what is important is just knowing how you might approach a situation like something we can reasonably infer.”
The hosts talk about how practicing everything helps with times of uncertainty.
Alex shifts to discussing how HybridOps has been a big learning over his 11 years of building PagerDuty. He talks about how early on a lot of the customers were digital natives and cloud-first, and how they helped us in developing our product and vision early on. Alex mentions how HybridOps comes into play as some of these organizations have both legacy systems and newer digital systems, they also have central operations and teams that are DevOps oriented that build and run and maintain their own systems.
Alex: “That’s what HybridOps is all about, it is the situation that these companies are in, that they need to operate in both modes at the same time, while working on modernizing their older applications.”
The episode wraps up by asking Matt the final two questions on his last Page it to the Limit episode.
Matt talks about how for the majority of his career he felt like his job was to defend production from DevOps but how that changed when he got into the DevOps mindset and changed his perception.
Matt closes by pointing out that he is really happy that in all the time he has worked for Alex he has never been asked to do anything with regular expressions.
Alex Solomon is the CTO and Co-Founder of PagerDuty. He is a passionate advocate for growing the community of PagerDuty practitioners by cultivating and sharing best practices that advance real-time operations.
Alex started PagerDuty in 2009 as founding CEO. He led the company through the first several stages of growth, from inception, product-market fit, multiple rounds of fundraising, building out the core functions of the company, and expansion of the product vision. He has served as a member of the PagerDuty board of directors since 2010.
Prior to PagerDuty, Alex was a software engineer at Amazon, where he built and maintained large-scale systems to help Amazon’s supply chain run efficiently and reliably. Alex graduated from the University of Waterloo with a B.S. in Software Engineering.
George Miranda is a Community Advocate at PagerDuty, where he helps people improve the ways they run software in production. He made a 20+ year career as a Web Operations engineer at a variety of small dotcoms and large enterprises by obsessively focusing on continuous improvement for people and systems. He now works with software vendors that create meaningful tools to solve prevalent IT industry problems.
George tackled distributed systems problems in the Finance and Entertainment industries before working with Buoyant, Chef Software, and PagerDuty. He’s a trained EMT and First Responder who geeks out on emergency response practices. He owns a home in the American Pacific Northwest, roams the world as a Nomad with his wife and dog, and loves writing speaker biographies that no one reads.
Matt Stratton is a DevOps Advocate at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader DevOps community, and back in the day, his license plate actually said “DevOps”.
Matt has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He is a sought-after speaker internationally, presenting at Agile, DevOps, and ITSM focused events, including ChefConf, DevOpsDays, Interop, PINK, and others worldwide. Matty is the founder and co-host of the popular Arrested DevOps podcast, as well as a global organizer of the DevOpsDays set of conferences.
He lives in Chicago and has three awesome kids, whom he loves just a little bit more than he loves Doctor Who. He is currently on a mission to discover the best phở in the world.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.
Scott McAllister is a Developer Advocate for PagerDuty. He has been building web applications in several industries for over a decade. Now he’s helping others learn about a wide range of web technologies. When he’s not coding, writing or speaking he enjoys long walks with his wife, skipping rocks with his kids, and is happy whenever Real Salt Lake can manage a win.