Quintessence Anx: Welcome to Page It to the Limit. A podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host Quintessence or @QuintessenceAnx on Twitter. Today we’re going to be discussing hypercare. Hypercare in this context is the state of elevated support that is required to maintain system availability and performance during episodes of expected surges in traffic. We are joined today by our guest panel from The New York Times. Megan Araula, who is the Lead Software Engineer and Election Readiness Technical Lead, Alexandra Shaheen, Program Manager and Election Readiness Program Lead and Vanessa Wan Technical Product Manager and Election Readiness Ameritas. Thank you all for joining us today and welcome to the show.
Vinessa Wan: Thank you.
Quintessence Anx: As our first order of business. Can you just each tell us a little bit about what situations weren’t hypercare at The New York times?
Vinessa Wan: At The New York Times, we always have some major temporary events that we want to prep for, things like Olympics or Supreme Court hearings. I would say overall though the elections especially when it is a presidential elections, those are critical events for us where we really implement hypercare. They’re pretty much like our Super Bowl, and they tend to be really large drivers of traffic to our site. We’ve always seen steady increases of traffic from 2016 to 2018 to now. Elections in general are always going to be major events for us to practice hypercare.
Quintessence Anx: Gotcha. Thank you so much. And kind of piggybacking off of that, what are some myths that you encounter when you’re implementing hypercare?
Vinessa Wan: To start off, there’s a couple of things that come to mind. The first one is that we know exactly what we’re prepping for, particularly in terms of what could happen in terms of the scenario or traffic level. It’s frequently getting asked from teams, “What is that level of traffic?” And if we knew, I think we would be in a much different scenario because these things are impossible to predict. And especially with this year where we were not just dealing with an election night, we were really looking at more of an election week. In addition to also dealing with the pandemic, this presented a lot of unknowns and things that just could play out. The fact is you really have to be ready for all of them.
Vinessa Wan: And I’d say also particularly with this year, we always traditionally had folks on-site and we would have these really large rooms kind of like war rooms, where people would come together in person to make sure we were monitoring the site. And this year really debunk that myth that you had to be on-site. Obviously, because of the circumstances with the pandemic, we had to really think differently about event support and we were able to optimize from the situation. And I think it’s actually more to our benefit.
Quintessence Anx: Gotcha. And Megan and Alexandra what do you think about myths that you’ve encountered?
Megan Araula: A myth that we encountered is that you don’t really need a long code freeze to be successful when looking at how to regulate code practices in a hypercare situation. Past practices advocate for a long code freeze and we didn’t agree with this and that we felt it was important for system leads to have autonomy and own their risk. Additionally, we felt as though long freezes promoted other types of risk. Be there for elected to restrict deploys to production for a shorter amount of time specifically one day than we had instituted in the past and this really worked well for us because it let system owners deploy code and not be stuck in this weird situation that they have piles of piles of code in the backlog until we lifted that code freeze.
Alexandra Shaheen: I would say another myth is that it’s easy to get resilience work prioritized and folks excited to work on it. The first phase of the Election Readiness project was an assessment phase where we looked for vulnerabilities within our tech stack. People have worked hard on these systems for years now and we were tasked with providing input on how these systems could be improved. Sometimes identified resilience opportunities are not always in line with the work the team has planned on doing in a given year and we interrupted a few road maps in ways that weren’t always celebrated. That said the work that we did was worthwhile. Additionally, business continuity projects are not always exciting. There’s a good chance that what you build is not often if ever used. Building something for a just in case scenario isn’t always as exciting as building a feature that will be used every day. And that’s what made this project a little more challenging than others.
Quintessence Anx: That makes sense. Thank you all so much. New York Times has been around for a while, elections have been around for a while but how long have you specifically been doing hypercare preparedness?
Vinessa Wan: I’d say we’ve always done some version of hypercare to prep for elections, but I would say that where I’ve been involved and I’ve seen even more of an effort really started in 2016 and then with midterms 2018 as well. Just given the political landscape normally we wouldn’t be needing to put in that kind of prep for an elections, but again we were expecting 2018 half record traffic and which we did. I see that this year was the biggest effort we had really intensified with 2020 just because our digital subscribers have just been growing so much. And given already you receive a lot of traffic with the pandemic and there’s a lot of excitement around the election overall. This I think Megan and Alex could really speak to this about, we just put in a lot of effort for 2020 that moment we might not have been putting in.
Vinessa Wan: Like this year was also the first time we really looked at it differently. Normally we would just look at some systems and do some prep. This project would start over the year and we really started not by looking first at our systems, but we actually looked at what were the business workflows or anything PagerDuty can refer to that as business services. And we really started the conversation of what are the things our business and our readers and our newsroom cares about the most and we actually tiered them according to criticality. And then we looked at the systems afterwards and that really provided a foundation for not just how we wanted to look at 2020 but really how we want to look at our user experiences and our systems going forward.
Quintessence Anx: Gotcha. And that makes sense. Can you tell us a little about some of the pain points you’ve experienced over time as you’ve been iterating over these processes?
Alexandra Shaheen: Yeah. Gaining visibility and understanding of all the technology teams within The Times was really tough. This is a big technology ecosystem. Generally the three of us are situated within the umbrella of Product Engineering. However, we needed to build relationships with other technology factions within The Times, including print technology and interactive news. We tried to bridge the gap and communicate between these groups as transparently as possible throughout this entire project. One example of this where the UX alignment meetings that we pioneered with this Election Readiness project. These meetings were intended to get multiple groups, interactive news, growth, brand and just the general technology group on board with one another’s plans for site UX, for major election events. These transparent meetings enabled all of technology to see the UX plans of these disparate groups and form a cohesive understanding of what our site would look like on super Tuesday and the general election.
Vinessa Wan: Yeah. And Quintessence still like really build on what Megan said. This was a really big effort. And I would say now it’s still a lot of work to really make sure that this was not about critiquing a design but really making sure that anybody working on elections could be able to understand what our expected user experiences were and ask questions more about like, “Hey, if this doesn’t work, what is acceptable?” And really getting everybody just really kind of understanding what to expect and really being open and transparent. I think this is something that even though New York Times we’re a large company and we have so many different teams but I think this is really something, a lesson that any size companies should be able to do is just kind of having that like dress rehearsal and understanding what to expect.
Megan Araula: Yeah. One thing that was really important with the election was making sure our site won’t go down and we didn’t know, yeah.
Alexandra Shaheen: Yeah.
Vinessa Wan: I think my mom was going to listen to this and that’s usually sometimes how I have to explain my job.
Megan Araula: Yeah. And the problem with this election is we didn’t know how much traffic we were going to get, all we had to do was guess how much traffic we were going to get. And the way we had to prep ourselves to do that was doing stress testing with multiple applications. There’s 10 or more applications doing simultaneously stress testing. And this was something new in the organization, we’ve never done this before and it was fairly difficult because there was an actual configuration of that technology to apply the artificial load as well as inventing a process for doing so at scale. And we had to advocate for the numbers we were shooting for and these were insanely large numbers. And many folks doubted these numbers, in the end the numbers we projected actually happened on election night which was great, it was amazing.
Megan Araula: But we needed to negotiate with product and business to get permission to break production whenever we do this. And we overcame this hurdle where people like, “Are you sure that we’re going to reach this threshold?” We overcame it by looking at historical data. How did we do last time? And what should we project? And we really told everybody what, how and why we got these numbers. We also did buy ins from management that we needed their help to let other people know why we have or we predicted these numbers.
Vinessa Wan: Yeah. Because also there’s a big lesson because some folks may not be as familiar with stress testing versus typical load. But especially when you’re covering a major news events and news can happen at any time to be able to conduct a stress test where you are saying, we actually want to break things because we want to understand where the vulnerabilities are. And then also on the other side, you want to see how people react when things go wrong, because things are going to always go wrong. It is a really large trust exercise with everyone in the organization. Because it can be a really scary thing. And there was definitely some points during the year when we were stress testing that we had a lot of major news events. This was again something that I think really goes to the credit of Alex and Megan. I always remember being in a room and really explaining that we were going to do this. And then also doing these things while everyone is working remotely was yet another hurdle.
Alexandra Shaheen: For sure. There’s nothing like applying artificial load to a website while everyone is on a hangout together remotely and just hoping it all goes okay, but it did.
Quintessence Anx: It’s awesome. I’m glad it went okay because I know I had your map queued up just to see what’s going to happen.
Vinessa Wan: Everyone refreshes there.
Quintessence Anx: Yeah, right. Can you tell us a little bit about how you’re currently implementing hypercare?
Alexandra Shaheen: The New York Times has a sound incident management process that we rely on day-to-day, our business is breaking news and this process does as well. However, for a major event like the general election, we need to do a bit more in terms of hypercare. We institute war rooms this year, we did virtual war rooms. We engage our vendors and let them know of what we’re expecting and form the appropriate relationships as needed. We also look at communication forums and how we can outwardly communicate to a broad range of folks as needed for this event preparation. We send communications over via Slack, we send emails, we say things. It’s important to grab everyone where you can reach them. And we make sure that we on the night of, have our eyes on glass and fast collaboration should we need to have fast collaboration. This is not something we generally need hypercare in our day-to-day but when you have traffic levels like this and an event that is of critical business and reputational importance, we absolutely need to institute hypercare.
Alexandra Shaheen: We also knew ahead of time that this level of hypercare was not sustainable if the race weren’t called on election day or the day after. Post November 4th, we adjusted our processes so that heightened support was still provided but with less pressure on those doing it. We didn’t have everyone sit on a hangout in perpetuity instead mandating that each system not only just have one engineer on-call but two so that if we saw an incident, two people can jump on it as quickly as possible. We also instituted deployment windows where we enabled code to be deployed versus having freezes or other restrictions. These worked well for us.
Megan Araula: Yeah. Just to add to that, we’re not always in hypercare because it’s impossible to have a hundred engineers and hanging out for 24 hours or more. We only do when there’s a big event. We can’t possibly engage in these practices long-term as they are costly and draining. And in our day-to-day we rely upon on our regular incident management process which is using PagerDuty it’s just like there’s an incident that is happening.
Vinessa Wan: Yeah. I think in addition to all the event coverage and guidance, we also had a few different work streams to really make sure that we were ready in advance. We had a core team that was available. And what Megan got to do much from other engineers is really working with the team to review where the architecture was, identifying any risks or vulnerabilities. This team also led operational maturity assessments and really what that was is folks like Megan would work with the team and help themselves score how they were doing in certain operational practices. Kind of like, did they have a healthy on-call setup? And things like that. Did they do over analysis? And because again, we’re looking at a whole election season so that way that team could make sure they weren’t just prepared for the night of, but also for everything in between.
Vinessa Wan: And even with this election, we didn’t just have a night of, we had more of an election week. And then Alex and Megan talk a lot about just having these regular stress tests. And what’s important to call out is after each stress test having a learning review or kind of a blameless post-mortem to make sure the team is actually identifying what they need to do, improve them and really going to the results. And we also did have some dedicated efforts around just building out particular resiliency, where we found that we really wanted to protect our most mission critical workflows. And then I would say of course, then again we also had that coverage and all the guidance around deployment windows and what was needed to make sure we had all that immediate support so we could jump-off into things when they happened.
Quintessence Anx: Yeah. That sounds super resilient and that’s amazing. And since you all have kind of been iterating over this over time, what would you suggest to people who are just starting to research and implement? Because they realized, “Oh, I actually am in a scenario where we’re going to need elevated support.”
Vinessa Wan: I think leveraging this as a team building experience, and usually these types of events are what gets folks excited. It’s an opportunity to come together. And sometimes this is the work. I like to call this as we are sometimes the stage crew and sometimes you don’t get recognized for this type of work because your job is to make sure that people don’t notice you. I think really getting people excited and I think also making sure you’re building a team around us. Not just having one person but also building resilience into your team structure.
Vinessa Wan: When I kicked off 2020 effort, I was really excited to have someone like Megan joined because Megan had a lot of solid engineering experience, but I was excited to have her really be a lead in this effort because she was going to give a fresh perspective and that was balanced with some other folks in the team that had worked on previous elections. And then Alex was someone that also came new with a lot of energy and a lot of new ideas. I think there’s an opportunity to brand it as an effort to build unity and get people excited. But then also looking at how you’re going to build that team out and even just building resiliency into how wonderful resiliency that makes sense. I feel like there’s a drinking game or a bingo.
Quintessence Anx: Yeah, I get you.
Alexandra Shaheen: I would say Vanessa’s right. You need to form a cross-functional group with folks of diverse expertise and knowledge bases. We had a group of engineers from systems all around the org who created and conducted these initial assessments that we use to gauge readiness. With so many systems in play at The New York Times it’s critical to form a diverse team. And that really enabled us to be successful. We had folks that worked on DevOps, we had folks that worked in e-commerce, we had folks that worked in publishing and from these diverse perspectives we began to just form a uniformed opinion of where things were at, at The Times.
Vinessa Wan: Yeah. And she also even had strong grounds for all project managers in the war too.
Alexandra Shaheen: Absolutely. It takes a village to get something like this done and you need in your immediate committee of those architecting your plans, a diverse and representative group that represents the village. That was really important.
Megan Araula: Yeah. I think before this project, I haven’t worked with Alex so that’s pretty good. It’s also a good way to meet new people in the organization. That’s great.
Vinessa Wan: Yeah. That’s definitely an exciting part. I think you’re right. It was cool to work with each other, we had a fun team. It’s a great opportunity especially for someone to get exposure in an organization or someone just to really be able to meet new people.
Quintessence Anx: Yeah. That sounds like it would be the case just by the cross-section of skills that you all are listing just now, that sounds really important. And as you’ve kind of been moving along with all these initiatives, are there any gotchas you found that people might not know that they can kind of get ahead of when they’re doing their first implementations?
Alexandra Shaheen: I would say scope your work is number one. Those assessments and tiering that we did were crucial. You are not going to be able to improve everything. You need to take a good look at what systems are most important to your company’s success and map your investments appropriately there and guard scope with all that you’ve got, as your management layers are going to want you to fix anything and everything as possible. With something like Election Readiness, we have a finite timeline. The election is going to take place on November 3rd, regardless of our readiness status. We need to structure our investment in the time that we have and make sure we’re putting our efforts on the stuff that is truly critical to our company’s success. And this is important to just retain the scope at all times and message this scope out so that the scope of this is clear to everyone and you’re not being questioned on it.
Megan Araula: Yeah. Another gotcha. Is just because a system hasn’t failed in a long time doesn’t mean it won’t fail during the event. This has happened to us and we were lucky enough it happened like a few weeks before or a few months before and we were able to have a strategy around it. I think that’s really important for not to be over confident in a system that’s been relying for like two or three years. Because you never know for some reason it’ll go down, just think about that, keep testing that system or just have a strategy if that fails. It doesn’t have to be implemented but if you have it written down and think about a solution if it were to happen.
Vinessa Wan: Yeah. I think also to it not just in terms of your systems failing but fatigue is real. This was a long election for us and that really for those on-call for those managing elections, you really need to make sure people, no one to stand down, no one to take a break. And I think too also when you’re designing your processes and your event coverage guidance being mindful, also the fact that people are going to be fatigued. You want to make sure that you have enforced things like having more than the same person cover a system for like two days. And letting folks know that it matters and that you actually are able to take a break if you’ve worked on an incident for a certain amount of time and that that’s fine.
Quintessence Anx: That makes a lot of sense. And thank you so much. All of you ladies for just the wonderful conversation and all of the information you’ve been providing us today. And if you all listeners of our podcast like to learn more about Election Readiness at The New York Times, there is a Times opens blog post that is going live you’ll find the link and the show notes. And The New York Times is also hiring. So it’s never been a better time to work for them. Check out their open roles also linked in our description. And you can also support The New York Times by subscribing to the news. And with that, we do have two final questions that we ask everyone who’s been on the show. Are you all ready?
Alexandra Shaheen: We are.
Megan Araula: Yeeha!
Vinessa Wan: Yeah, sure go for it.
Quintessence Anx: Okay. What is one thing you wish you would’ve known sooner about hypercare?
Alexandra Shaheen: Hypercare requires sound maintenance and communications. When you’re leading an org through hypercare, especially over an extended period of time, don’t assume that folks always know what to do inherently no matter how many times you’ve repeated it. You need to have a plan for everything and then have a plan for those plans just in case they go wrong and then continue to communicate those plans via every forum possible. Repetition is your friend here and you need to communicate transparently at all times when engaging in hypercare.
Quintessence Anx: Awesome. And is there anything you’re glad that we did not ask you about?
Megan Araula: Thank you for not asking what went wrong, because almost always something does go wrong. And it’s about how you manage the situation not about what failed. The circumstances are the bare bones of eventual retrospectives. Thank you for not asking.
Quintessence Anx: No problem. Thank you all again, Vanessa, Megan and Alexandra for joining us today and this is Quintessence wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @PageIt2TheLimit using the number two. That’s Page It to the Limit with the number two. Let us know what you think of the show. Thank you so much for joining us to remember uneventful days are beautiful days.
For a full transcript of this episode, click the “Display Transcript” button above.
Megan is a lead software engineer at The New York Times. She enjoys solving a variety of business problems across multiple teams and missions while also advocating for system resilience and maturity. On her spare time she likes collecting plants and hitting up the slopes to snowboard.
Alexandra Shaheen started in non-profit administration, but dove right into the systems used to manage the grant-making process. She realized a love for the realm of engineering and seeing requirements result in tangible systems that make important work easier.
Alexandra joined The New York Times in 2018 as a program manager for the team responsible for building and rolling out a new article editor for The Times. After the article editor’s successful rollout in late 2019, Alexandra started as program lead on the election readiness project. She managed the assessments of critical systems, ran stress tests throughout the year, led resilience projects to fortify workflows with single points of failure, and created event preparation requirements for all of technology. She considers this to have been her dream project.
Vinessa is a technical product manager at NYT. She love to apply design principles to developer tooling and resilience engineering concepts to her daily life.
You can check out her essays in the upcoming O’Reilly book, 97 Things Every SRE Should Know (https://97things.incidentlabs.io/). When she’s away from her keyboard, you can find her building lego castles with her daughter.
Quintessence is a Developer/DevOps Advocate at PagerDuty, where she brings over a decade of experience breaking and fixing things in IT. At PagerDuty, she uses her cloud engineering background to focus on the cultural transformation, tooling, and best practices for DevOps. Outside of work, she mentors underrepresented groups to help them start sustainable careers in technology. She also has a cat and an aquarium with two maroon clown fish and a mantis shrimp, of The Oatmeal fame.