Real-Time Learnings With the PagerDuty Community Team

Posted on Tuesday, Apr 7, 2020
In this episode, the PagerDuty community team discusses what they have learned at PagerDuty and in the industry over the last 20 years.

Transcript

Julie Gunderson: Welcome to Page it to the Limit, a podcast where we explore what it takes to run software and production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson, @Julie_Gund on Twitter. We have multiple show hosts for this episode. Today, I’m also joined by…

Alex Soloman: Alex Solomon, @alxsolomon on twitter.

George Miranda: George Miranda, @gmiranda23 on Twitter.

Matt Stratton: Matt Stratton, @mattstratton on Twitter.

Scott McAllister: And Scott McAllister, @stmcallister on Twitter.

Julie Gunderson: Today, we’re going to talk about our learnings at PagerDuty and the industry over the years. We have the advocacy team with a combined 70 years of experience, and we want to share the main learnings over the years with all of our listeners. To get us started let’s frame the conversation by describing what realtime operations is and why it matters for organizations for anybody that’s new to this. Alex, do you want to kick us off?

Alex Soloman: Yeah, sure. Real time operations, to me what that means is it’s about dealing with problems and incidents and alerts in real time, making sure that the right people are pulled in whenever you have an issue with your production software and only the right people. Those teams and those individuals are looped in quickly, looped in via multiple channels to make sure that they get there classed. And then once they’re paged that looped in it’s about collaboration, it’s about communication, it’s about coordination. It’s about defining clear roles for all the individuals and making sure that they can collaborate and communicate effectively to make decisions quickly and resolve the underlying problem with those production systems.

Matt Stratton: That’s a really good point, Alex. And I think that’s a great definition. To me, just to sort of put another little bit of polish on that is it’s about moving away from cued work, right? It’s not just sort of taking tasks one right after the other, and then they have a sense of being done and then somebody else does their part. When I think about real time operations, it’s the fact that we are always in this mode, right? And that even extends to how we think about incidents and how we’re continually learning from them.

George Miranda: Yeah. And I think the thing that I would add to that is that it’s not just about moving away from some of those definitions, but it’s also moving away from the idea that it’s just web services, right? I think, especially on this podcast, we tend to talk about operating software in production and that’s the thing that we center around. But I think generally speaking, we think about the services that our teams provide, right? The code that we shape in production and making sure those things are available, but this extends to every facet of online operations that might impact our team, right? Whether that’s the web services that we provide and the code that we’re writing and how that runs in production or production systems that we rely on, right? Maybe a CRM, if you’re a sales team or maybe an outreach tool, if you’re a marketing team. I think the definition is pretty broad and that’s something to keep in mind as well. But on that note, I think there are a number of misconceptions that folks might have about real-time operations. So on that note, the thing that we typically do in this show is talk about myths that we would like to debunk. So this one, Alex, since we haven’t had you here before, I’m going to pick on you and ask, what myth do you think you would like to debunk about real-time operations?

Alex Soloman: Sure. I can give you an easy one that I see a lot when working with customers, especially larger enterprise customers. The myth that you can buy a software platform like a PagerDuty or a Datadog or a New Relic or any of these toolbox that we all have when we’re running digital systems and that that buying the platform will solve all of your problems, it will be your silver bullet. In my experience, what I see over and over is that yes, you can buy the platform, but the hard part is changing culture and transforming culture and transforming the way people work and that comes down to people and process. So in our case, it’s a lot about transitioning to a DevOps-oriented culture. It’s about the concept of full service ownership and ensuring that the engineers who are writing production systems are also responsible for supporting those systems, which inevitably means that those folks will go on call. The devs will go on call directly and when something breaks they’ll get paged for the systems that they own.

Matt Stratton: I think another really popular misconception is that with enough information, we can be totally predictive, right? And this is where hindsight bias tends to bite us when we’re thinking about learning from an incident, which is when we look back at things we say, “If we had enough information, we could have predicted this failure, “and we can then guard ourselves against failure and that will always happen. And the reality is we can do a lot to kind of steady ourselves and be ready to respond and take information we’ve already had, but our systems are so complex, right? That there’s no way to be fully predictive. And we need to understand how we can make our sociotechnical systems more resilient rather than thinking if we just build in enough fail over, or enough automation or write the best possible runbook ever we’ll be able to prevent failure. So I think a big myth is that we can prevent any kind of failure as opposed to being able be resilient and respond and minimize and mitigate the impact of that failure and learn from it.

Alex Soloman: Yeah. To that end, Mattie, I think in my mind, I correlate failure to changes. If you’re taking a production system and you’re changing it all the time, you’re making deploys, you’re changing config, you’re scaling up, you’re evolving the system on a daily or weekly basis. Then with that change will come some failure because you won’t be able to predict everything at a time and prevent all of these failures. But the ideal situation is that your systems are designed for failure. You have canary builds, you have ways to detect problems and rectify them quickly so that failure does happen, but it is small and the impact is small when it does happen.

Julie Gunderson: Thank you both. So with all of us on here, let’s talk about what we’ve learned at our time at PagerDuty that we want to share. Mattie, do you want to go ahead and start us off?

Matt Stratton: I got to say that, I think when I look back at the things that I’ve learned over my time at PagerDuty, one of the most impactful really is leveraging the Incident Command System for incident response. It’s something I was a little aware of before I came to PagerDuty, but it’s something we believe in really strongly here. We’ve written about it. You could go to response.pagerduty.com to read all about it. And it’s really such an impactful way to think about when you’re in that mode of responding to an incident. And it’s had a major impact in how I think about response really and I think George might be chuckling, not to put words in his mouth, because he has a background in this before, so he didn’t necessarily learn it. But for me, someone who doesn’t have a background in emergency response like that, it was really educational and transformative in a lot of ways to how I think about incidents.

Scott McAllister: And for me, I’ll switch tracks a little bit. My time at PagerDuty has been entirely remote. And as the world right now is going increasingly more remote. I think it’s apt to think about how to be successful as a remote worker. For me, I’ve learned that you need to take breaks. I need to get out. I actually schedule time for walks every morning and every afternoon just to get out so I can get my head in the right spot and also so that I can keep my body moving. But also a key thing that I learned between my current remote job and a job I had before, where I was remote, was the need of self promotion. Now, for myself, I’m not the kind of person who sits there and says, “Look at me, I’m so amazing. Look at my stuff I’ve done.” But when you’re remote, you don’t really see people around you very often. You kind of need to be loud about the things that you accomplish and the tasks that you complete and the good feedback you get from customers and things of that nature. So I think that’s something to keep in mind as remote worker. And finally, as an advocate, I have a different focus than the rest of you on this team. You all talk about DevOps thought leadership and things, and I focus more on developers and how developers interact with PagerDuty. And a lot of times people think that advocacy is all about conferences and speaking and traveling all over the world, while that is a part of the job, especially as a developer advocate, when you’re working with an API or talks or an API and building apps, excuse me, thinking about talks, we always talk too. And that’s to be an internal advocate, someone who advocates for the API, who works with the teams internally as we build APIs to work with the world as we’re connecting with each other and staying real time and putting information where people are already working. It’s good to have an opinion of someone who’s seeing the rest of the industry and how the customers and partners are doing with our API and taking that back to our internal product teams with the API. And it’s been fun to watch the differences that we’re making on our APIs today at PagerDuty. I think we have great APIs, but I think we can always improve. And we’ve been making great improvements through the conversations we’ve been having, taking the stuff that I’ve been learning with our customers and with the partner developers and sharing it with our API teams. And so those are a couple of things that I’ve learned in my time at PagerDuty.

George Miranda: Scott, you put a lot of stuff out there in that one chunk, and one of the things that I really like about that is that it’s basically two sides of the same coin, right? You were talking about advocating internally to the company and externally to developers and at the same time, that’s kind of what you were saying about remote work, right? It’s really about how you communicate in different scenarios as folks are distributed. So I wanted to ask you, is PagerDuty the first time that you had a fully distributed team? Or when you’ve worked remotely in the past, has it been you being remote and a couple of other folks that are in the office? How has that usually played out for you?

Scott McAllister: So my previous work experience where I was working remote, I had an office, I live in Western Washington outside of Seattle. And so there was an office that our company had in Bellevue, but nobody on my team was in that office. And so my managers would either be on the East coast, or… I think one time I had a manager in Montana. We were pretty distributed as it worked. And so it wasn’t necessarily a situation like we have here at PagerDuty where a lot of the company is based out of offices and then our team is remote. So I guess to put a long answer on that question. Yeah, it was kind of a… My team was remote, but we had an office.

Julie Gunderson: For me, this is actually my very first remote job. In my last role where I was kind of on a team by myself, I still sat with the engineers, I still had that social aspect, even though I did something completely different. And I will say the thing that makes it the easiest is our culture of video on all the time. I love that. It’s hard for me when I join calls now and there are people that are not on video because it at least gives you that sense of community and interaction. And with our team being fully distributed, I think it still helps us have that sense of actual team.

George Miranda: I’m going to put out one little bit of trivia for folks that are listening to this podcast. In terms of video on, that’s one of the things that we do when we record this podcast. Even though we don’t provide video, we’re all looking at each other so we can pick up on some of those bits of communication and inflection and tone that you might miss otherwise when you don’t get to see folks. And I think that’s one of those details about remote work that often gets overlooked. So I’m really glad you pointed that out, Julie.

Scott McAllister: Right. Julie, I agree completely. I think our team at PagerDuty, especially our community and advocacy team have a great relationship with each other because we connect over that video chat every day and we make that connection as a team. So, I feel like I have a good relationship with each member of the team. It’s almost like I know you almost, maybe even better than I would if I worked in an office with you. I don’t know.

George Miranda: And on that note, we’re going to talk about something super timely, which is the shift to remote work. And I think this is something that’s affecting a lot of folks as we go through the COVID-19 crisis. And many folks are now starting to work from home, many teams that haven’t been remote in the past are now working remote. So maybe we can talk a little bit about how real-time operations is impacted by that shift to remote. Alex, maybe do you want to kick off that discussion? Any thoughts there?

Alex Soloman: I think on the technical side, on the engineering side, on the operation side, we are fortunate in that we’re a bit ahead of the curve. That we’ve been… If you think back to the last 20, 30 years, we started with data center environments and with folks needing to be on sites to folks needing to be in the office on the corporate network. Now, more recently we’ve had remote access tools and VPNs, and then there’s more digitally native companies that don’t even have a real office network anymore and everything’s in the cloud. And PagerDuty as a company we’re on that latter stage where we are fortunate to be one of the digital natives, but even if you’re in that previous generation, we have the tools to VPN and to work remotely from anywhere. And with the advent of mobile as well, now we have apps for all of our critical business applications like Salesforce and reporting applications and things like ADP for benefits. So, we’re in a good spot where we have the tools necessary to work remotely. I think the biggest challenge and the biggest gap for us is going to be around the culture of remote work. Because if you don’t have teams that are used to that, if you don’t have individuals who are used to that, it’s going to be challenging for them. I could speak for myself. I do like the routine of coming into the office every day and having that commute time to prepare for your day and having that separation between work life and home life. And that’s mostly gone away and it’s been a little bit of an adjustment for me personally. And I can think of other folks who now with schools being canceled, all their kids are at home as well, and things can get pretty crowded. I don’t have any kids yet, but some of the parents on the call I’d love if you guys can chime in and talk a little bit about that aspect of working remotely.

Julie Gunderson: I have to say that I think that it’s going to be interesting the changes that we see in the world as to how we integrate our families into our work lives based on this, because it is a complete new world for a lot of people of having their kids at home with them. I’m really lucky because I have a 16 year old who will be very happy to watch YouTube videos on Minecraft all day long. But I did something today as his school got canceled today and I packed him his lunch. So I packed his lunch, I put it in the fridge for him so that he didn’t interrupt me throughout the day. He also knows that when my door is shut, he’s got to have some separation. That’s a little bit different for me. Mattie, you’ve got some little ones. How does that look for you?

Matt Stratton: A lot of it is… I’m fortunate, and I think there’s something that bears mentioning is that a lot of folks who are in a remote situation because of COVID-19. This is not normal remote work, what most people are handling right now, because it’s everybody all at once. But I’ve been a remote worker for very many years and so my kids grew up with this. In fact, I don’t think my kids would even understand what it would mean for me to go into an office. But a lot of that is training around either, if you have the ability to have a place with a door, that’s great. Then you can say, “Hey, when the door is shut, I’m working. It’s not just because I’m home, I’m 100% available.” But you don’t always have that. So a trick that I’ve done is… And my kids now have a little more cognizant awareness of when I’m doing certain things and they’re a little more respectful of interrupting. But what I used to do is I used to wear a special baseball hat if I was going to be in the main room. And it was like, if daddy had that hat on, he was working and for all practical purposes, I was invisible. And that worked about half the time, but that’s better than none of the time. And I think the other thing just to keep in mind while you normally wouldn’t have children interrupting you when you’re sitting in the office at the workspace or whatever, that’s a little bit of the reality of that. And same thing with cats and dogs and things like that. And that’s just part of the remote work experience and being aware of that with your colleagues, especially when they’re in a non-ideal remote situation, and I think it is a place we can have a lot of empathy for, right? And it also makes things a little more personal. You get to know your coworkers a little bit better if their dog is jumping up in frame all the time so long as it’s not distracting. I think that’s the other thing too when I think about how our real-time work is changing, folks are maybe used to being able to walk around the office and just go hit up somebody and ask for an update or things like that. That’s a little bit of that culture change that Alex was talking about. Besides the remote stuff though, Julie, what else have you learned in your time at PagerDuty that you’d love to share?

Julie Gunderson: The biggest thing for me and talking to all of these companies and hearing talks at conferences, but more so what keeps getting brought up over and over in meetings is that every organization feels they’ve got a very unique story to tell, but it’s not as unique as they may think. A lot of these organizations, they may have a different journey, but they’re still on kind of the same level as to what they deal with. One of the big things is HybridOps and that’s a thing where you’ve got organizations that grew up with this legacy and monolithic systems, and then they’re building things now the DevOps way, and people really struggle with communicating with each other in those organizations and how do they do that? And how do they share information? I would say that that’s really my biggest learning so far.

George Miranda: Yeah. I would echo that. And talking to customers, I think a lot of folks are dealing with inherent challenges. And I think as an IT industry a lot of us are dealing with the same problems and we don’t always realize it. Not to go back to remote work, but for example, that shift to remote work, right? Suddenly everybody’s talking about this and I think it’s the same when it comes to real-time operations. This is a problem that everyone is having. And many organizations are still trying to wrap their minds around how that works. So some of these stories are, like Julie said, maybe not as unique as you think they are. I think I’m going to switch gears a little bit and say that the thing that I learned… Matt called out to a little bit earlier, so he’s right. I do have a background as a first responder. And one of the things that has really floored me is just how much of the things you learn as a first responder apply to managing real time operations. And a lot of that comes down to preparedness, right? To having a plan, to knowing what you’re going to do when those unexpected surprises come up. I mean, and again, somewhat timely the topics we’ve been talking about, COVID-19 for example, right? When these things occur, you need to have a plan for how you’re going to respond, hat happens if you can’t come into the office, how things are going to shift, right? And I think a lot of us were caught off guard, but the thing that I understand about having a plan is you cannot possibly plan for all of these things. There’s no way that a year ago we would know that this is what we would be going through in Q1 of this year, right? But what you can have is you can have repetition and practice around what happens when a type of crisis occurs, right? Who responds? Who’s going to take point on figuring out what the official response is from a company? Who’s going to take point on internal communications? Who’s going to take point on figuring out just the landscape of things that we haven’t thought about in this type of crisis? And again, from an incident commander perspective, who is going to just make sure that the process is moving along effectively? And so having a plan is not about following that plan to the latter, because we never know what we’re going to expect. Real time operations is completely unpredictable. But what is important is just knowing how you might approach a situation like something we could reasonably infer, right? We might have some reason that a thing occurs and an office gets shut down, whether that’s a natural disaster or this big superbug, an epidemic that came out of nowhere, how do we respond to things like that? Right? You can’t plan every detail and you don’t have to, you just have to have a general idea of what you would do so that in the moment you work spontaneously and you work quickly to get a resolution.

Julie Gunderson: George, one of the things that I love that you said there was about practicing, and we’ve done a couple of episodes on chaos engineering, but practicing in the terms of what we talk about at PagerDuty with incident response, if you have an incident it’s a P1 or a SEV1, and it gets downgraded. You treat it like it’s a real incident so that you can practice and doing that with your postmortems. And I think that you’re right, we can’t predict everything, but taking every opportunity to practice what we do know will help us in those times of uncertainty.

George Miranda: Absolutely 100%.

Alex Soloman: For myself, I’ve learned many, many, many things in the last 11 years of starting and building PagerDuty. So I’m going to pick one thing that really stands out and it relates to what Julie was talking about earlier around HybridOps. Early on in our journey, we started with a lot of customers that were digital natives like ourselves. They were cloud-first. They had a pretty homogeneous environment. They had adopted DevOps and SRE practices, and they became great customers for us because they had software and production systems that they needed to run. They were pretty effective at managing their production systems and setting up on-call schedules and response processes. And they helped us in developing our product and our vision early on. And then over time we started getting larger enterprise customers, and that’s where HybridOps comes in, because a lot of the enterprises we work with they’ve been around for many decades. They have a lot of on-prem systems. They have their own data centers. They have legacy systems that they have to maintain sometimes even on mainframes, but they also have the newer digital systems that they’ve been building with a cloud-first containerized architecture. So they have kind of both. They have the legacy systems and they have the new systems. They have central operations and an operation center that’s run 24/7, but they also have teams that are DevOps-oriented that build and run and maintain their own systems. That’s what HybridOps is all about. It is the situation that these companies are in that they need to operate in both modes at the same time while working on modernizing their older applications. And that’s a big challenge that a lot of our customers are dealing with. And it’s also a cultural challenge because you have more traditional operations culture, co-existing with more DevOps, nimble agile culture at the same time. So how do you make those teams talk to each other? How do you avoid siloing and how do you avoid the them-versus-us type of problem and kind of create some cohesiveness and some transparency and some visibility across those two modes of working. And that’s a charge that we’re working with a lot of customers about. We’re advising them on how to move stuff to the cloud, how to transition, how to modernize their operations and improve their operational maturity while coexisting and understanding that this kind of a transition is not going to happen overnight. It’s going to take potentially many years for it to happen. On the show, there are two things we ask every guest, Mattie, since this is your last show as a PagerDuty employee, we’re going to ask you the questions. The first question is, what is one thing you wish you would have known sooner when it comes to running software in production?

Matt Stratton: That’s great. I think about this a lot, this particular topic that I’m about to share, and it might come as a surprise for those of you who know me for the last seven, eight years of being such a proponent of the DevOps movement and everything. But the majority of my career as a sysadmin, I felt like my job was to protect the company from the developers, right? So I thought that my big job was that developers were always trying to mess stuff up and my goal was to play total blue team against them and figure out all the ways they were going to try to get around all my safeguards and defend production from developers. And if I had gotten my mind in this more DevOps mode sooner, not only would I probably had been more effective for the organizations I worked with, I probably would have had more friends among software engineers.

Scott McAllister: Yeah. You and I wouldn’t have been friends at that time of your life.

Matt Stratton: No, no, no.

Alex Soloman: And last question, is there anything about running software in production that you’re glad I never asked you during your time at PagerDuty?

Matt Stratton: I am really happy that in all the time I’ve worked for you, Alex, you have never asked me to do anything with regular expressions because I can do it, but it’s ugly and it’s slow and I have to cheat a lot.

George Miranda: I can’t believe that regex is such a thing for you Mattie.

Matt Stratton: Don’t tell anybody.

George Miranda: How are you supposed to swing in on a rope and save the day?

Matt Stratton: Well, because I have all the nice little cheater things. Regex replaced subnetting for me and the thing that I had to cheat at, right? As a network engineer, everybody thought I was really good at subnetting, but it was because I had cheats and regex is the same way. So don’t ask me about subnetting IPv4 and don’t ask me about regular expressions and you will think I’m awesome.

Julie Gunderson: Well with that, we want to thank everybody for taking the time to listen to us today. If you want to know more about some of the things that we talked about, you can go to our ops guides. We talked about response.pagerduty.com today and postmortems.pagerduty.com. There’ll be links in the notes. Thank you for your time.

Scott McAllister: And this is Scott McAllister.

Matt Stratton: This is Matt Stratton.

George Miranda: This is George Miranda.

Alex Soloman: This is Alex Solomon.

Julie Gunderson: And this is Julie Gunderson wishing you an uneventful day. That does it for another installment of Page it to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two. That’s @pageit2thelimit. Let us know what you think of this show. Thank you so much for joining us, and remember uneventful days are beautiful days.

Show Notes

Why real-time operations matters:

Alex Solomon (CTO and Co-Founder of PagerDuty) kicks us off with a definition of real-time operations and why it matters.

Alex: “Real-time operations to me, what that means is, it’s about dealing with problems and incidents and alerts in real-time. Making sure that the right people are pulled in whenever you have an issue with your production software, and only the right people. Those teams and individuals are looped in quickly, looped in via multiple channels to make sure they get there fast. Then once they are paged and looped in it’s about collaborations, it’s about communication, it’s about coordinations, it’s about defining clear roles for all the individuals and making sure they can collaborate and communicate effectively to make decisions quickly and resolve the underlying problems with those systems.”

Matt hops in to discuss that real-time operations also encompasses how we learn about incidents and how we continue to learn from them.

George talks about how real-time operations extends to every facet of online operations that might impact our team, whether it’s web services or code we write and how it operates in production, and how the definition of real-time operations is very broad.

The Myth of Real-Time Operations

Alex talks about the main myth he sees with real-time operations.

Alex: “The myth that you can buy a software platform like a PagerDuty or a DataDog or a New Relic or any of these toolboxes that we all have when running digital systems, and that buying the platforms will solve all your problems and be a silver bullet. In my experience what I see over and over is that yes you can buy the platform but the hard part is changing culture and transforming culture and transforming the way people work, and that comes down to people and process.”

Alex goes on to mention that it’s about the people supporting the services and full-service ownership.

Matt talks about the myth that we can prevent failure.

Matt: “The reality is we can do a lot to kind of steady ourselves and be ready to respond and take information we’ve already had, but our systems are so complex there’s no way to be fully predictive, and we need to understand how to make our systems - our socio-technical systems - more resilient rather than thinking if we just build in enough failover, enough automation, or write the best runbook ever, will be able to prevent failure.”

The discussion moves towards how systems are designed for failure, and that we have ways to detect problems and rectify them quickly so we can detect and resolve problems quickly.

Sharing What We Have Learned at PagerDuty

The conversation moves to what we have each learned during our collective time at PagerDuty, whether it is the incident response process or postmortems.

Scott talks about how his time at PagerDuty has been entirely remote and how to be successful as a remote worker by being vocal about your wins, taking time for yourself and helping others learn about what you are doing by being an internal advocate.

George mentions that advocating internally and externally is about how you communicate with different folks that are distributed.

Julie discusses her experience with this being her first remote job and how the PagerDuty culture of having video on all the time makes being remote much easier by helping to build a great team relationship.

The Shift to Remote Work

The conversation shifts to how real-time operations are impacted by the shift to remote work.

Alex discusses how in the last 20-30 years it was about data centers and folks being on-site, but with remote tools companies have the ability to move to remote easier. However, the challenge and gap can be the culture of remote work if teams and companies aren’t used to that experience.

Julie talks about what it is like to work remotely with families in our homes while we work. She mentions how she packs her son a lunch like she would have if he was physically going to school.

Matt offers his story of how he has trained his kids to understand that he is working when he is at home.

Matt: “What I used to do is I used to wear a special baseball hat if I was going to be in the main room and it was like, if daddy had that hat on he was working, and for all practical purposes he was invisible, and that worked about half the time.”

Matt continues to talk about how we can be empathetic towards our co-workers and get to know them a little better.

Julie shares the biggest learning for her at PagerDuty is that:

Julie: “Every organization feels they have a very unique story to tell, but it’s not as unique as they may think. A lot of these organizations, they may have a different journey but they are still on kind of the same level as to what they deal with.”

Julie goes on to talk about how organizations are dealing with a lot of HybridOps situations.

George hops in to discuss how his background as a first responder applies to managing real-time operations:

George: “A lot of that comes down to preparedness, to having a plan, to knowing what you are going to do when those unexpected surprises come up.”

George continues to say you cannot plan for everything, such as COVID-19, but you can have repetition and practice around when a type of crisis occurs.

George: “Having a plan is not about following that plan to the letter, because we never know what we are going to expect. Real-time operations is completely unpredictable, but what is important is just knowing how you might approach a situation like something we can reasonably infer.”

The hosts talk about how practicing everything helps with times of uncertainty.

HybridOps

Alex shifts to discussing how HybridOps has been a big learning over his 11 years of building PagerDuty. He talks about how early on a lot of the customers were digital natives and cloud-first, and how they helped us in developing our product and vision early on. Alex mentions how HybridOps comes into play as some of these organizations have both legacy systems and newer digital systems, they also have central operations and teams that are DevOps oriented that build and run and maintain their own systems.

Alex: “That’s what HybridOps is all about, it is the situation that these companies are in, that they need to operate in both modes at the same time, while working on modernizing their older applications.”

Wrap Up

The episode wraps up by asking Matt the final two questions on his last Page it to the Limit episode.

Matt talks about how for the majority of his career he felt like his job was to defend production from DevOps but how that changed when he got into the DevOps mindset and changed his perception.

Matt closes by pointing out that he is really happy that in all the time he has worked for Alex he has never been asked to do anything with regular expressions.

Additional Resources

Hosts

Alex Solomon

Alex Solomon

Alex Solomon is the CTO and Co-Founder of PagerDuty. He is a passionate advocate for growing the community of PagerDuty practitioners by cultivating and sharing best practices that advance real-time operations.

Alex started PagerDuty in 2009 as founding CEO. He led the company through the first several stages of growth, from inception, product-market fit, multiple rounds of fundraising, building out the core functions of the company, and expansion of the product vision. He has served as a member of the PagerDuty board of directors since 2010.

Prior to PagerDuty, Alex was a software engineer at Amazon, where he built and maintained large-scale systems to help Amazon’s supply chain run efficiently and reliably. Alex graduated from the University of Waterloo with a B.S. in Software Engineering.

George Miranda

George Miranda

George Miranda is a Community Advocate at PagerDuty, where he helps people improve the ways they run software in production. He made a 20+ year career as a Web Operations engineer at a variety of small dotcoms and large enterprises by obsessively focusing on continuous improvement for people and systems. He now works with software vendors that create meaningful tools to solve prevalent IT industry problems.

George tackled distributed systems problems in the Finance and Entertainment industries before working with Buoyant, Chef Software, and PagerDuty. He’s a trained EMT and First Responder who geeks out on emergency response practices. He owns a home in the American Pacific Northwest, roams the world as a Nomad with his wife and dog, and loves writing speaker biographies that no one reads.

Matt Stratton

Matt Stratton (He/Him)

Matt Stratton is a DevOps Advocate at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader DevOps community, and back in the day, his license plate actually said “DevOps”.

Matt has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He is a sought-after speaker internationally, presenting at Agile, DevOps, and ITSM focused events, including ChefConf, DevOpsDays, Interop, PINK, and others worldwide. Matty is the founder and co-host of the popular Arrested DevOps podcast, as well as a global organizer of the DevOpsDays set of conferences.

He lives in Chicago and has three awesome kids, whom he loves just a little bit more than he loves Doctor Who. He is currently on a mission to discover the best phở in the world.

Julie Gunderson

Julie Gunderson

Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.

Scott McAllister

Scott McAllister

Scott McAllister is a Developer Advocate for PagerDuty. He has been building web applications in several industries for over a decade. Now he’s helping others learn about a wide range of software-related technologies. When he’s not coding, writing or speaking he enjoys long walks with his wife, skipping rocks with his kids, and is happy whenever Real Salt Lake, Seattle Sounders FC, Manchester City, St. Louis Cardinals, Seattle Mariners, Chicago Bulls, Seattle Storm, Seattle Seahawks, OL Reign FC, St. Louis Blues, Seattle Kraken, Barcelona, Fiorentina, Juventus, Borussia Dortmund or Mainz 05 can manage a win.