SRE Journey at Adidas With Andreia Otto

Posted on Tuesday, Mar 19, 2024
Successful Site Reliability Engineering (SRE) teams are skilled in both software and systems engineering, allowing them to manage reliable, scalable systems. They proactively identify and address potential issues, use failures as learning opportunities, and automate processes to reduce toil. They also prioritize communication and collaboration with other teams to ensure service reliability and performance. Join us, as we discuss the journey of SRE teams at Adidas.

Transcript

Tiago Barbosa: Welcome to Page It To The Limit. A podcast where you explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both systems reliability and the lives of people supporting those systems. I’m your host Tiago Barbosa, and you can find me on Twitter at t1agob. Hello Andreia! Welcome to Page It To The Limit. You are episode, I think like 110. And so welcome to the show. Thank you so much for accepting the challenge. For people that don’t know you yet, can you tell us a bit more about your role and a bit of your experience? That would be great.

Andreia Otto: Sure thing. Hello Tiago! Thanks for having me here. Thanks for the invitation. I’m Andreia at Adidas. So I work at Adidas as leading our SRE - Site Reliability Engineering and Operations team. I take care of the whole team that is responsible for the e-commerce platform. So we have team members all around the globe to make sure that we cover 24x7. Everything. Our e-commerce platform. A bit about me, I started with ALM when we call it that application lifecycle management when I worked with Tiago back in the days. So things started moving, evolving and today I’m leading the SRE team where I’m quite happy with.

Tiago Barbosa: Cool. That’s a very good evolution from, as you said, we worked in the past together and yeah, it’s really nice to see your career changes and the evolution that you made, so that’s great to see. Just out of curiosity, how many people does the e-commerce team have? The SRE team that basically handles e-commerce, do you have an idea?

Andreia Otto: Yeah, when you ask how many people I thought how many people in the whole team? So that would be a big number, but in the SRE team we have 66 internals. So in my direct team, this is the model view and we have it covered from different locations. So we have tech hubs in India, we have in Europe, in Zaragoza, Amsterdam, Herzogenaurach here in Germany where I’m based. And we also have people in Bogota. So we kind follow the sun to ensure a smooth transition.

Tiago Barbosa: Yeah. Cool, cool, cool. One of the things that I would like to ask you, and this is a big question I know, but one of the things that I’ve seen while talking to different customers in different companies is that for many customers, like the SRE teams are kind of established already. They have their processes, they have the way of working and providing capabilities to other teams in many cases. Others are more in early stages. They are just starting to figure out they are starting to build these teams, these capabilities inside of the team. And can you share a bit about the journey that you are currently going through?

Andreia Otto: We started the SRE journey, I mean DevOps journey when I started, it was seven years back, we were moving on DevOps transformation and then there was a clear need to get closer to operations. So at the beginning we had clear separation, what is development, what is operations, and then at that moment in time we’re like, we need to be closer, so we need to be faster, we need to collaborate better. We had some releases that were taking like a day and every six weeks, so it was not really manageable. So we started the SRE journey 3, 4 years ago. We started in the core of our systems where we have the whole checkout and payments processes, where for us, one of the major contributors of course, and the most critical components that we have, so we started with the embedded mode as we call, so the SREs Engineers, the ‘E’ is very strong engineers being part of the product teams. So we have developers, we have QAs, we have SREs. The SREs focusing on the observability, resilience and to make sure that before we go live or before any change is released, we already have this flavor of the stability mindset. And then we grew and we were growing to the embedded mode as a default mode. So we really needed to bring all the areas to the stability and it became each product team has some SREs assigned to. And what we are moving towards to now is to have a foundations team. So instead of only embedded, we need to have embedded where we need, but we also need to have foundations as you said before, right? Having SREs enabling the teams. Some teams don’t need SREs, some teams just need SREs to enable them, and then we can move back and focus on where we need to focus. And that’s where, so we started, our evolution was embedded, embedded, bringing stability, and now we are moving towards to let’s not have embedded everywhere, but where we need. And also let’s have a foundations to make sure that we have standards as much as possible that we can replicate what is working as much as possible.

Tiago Barbosa: I think part of the job of an SRE like you is to identify the places where things can be optimized because well, your customers are typically internal customers, the different teams that you are supporting and there’s always space for improvement independent of tools and processes that you are using. So that’s good to see.

Andreia Otto: Our customers, they are internal, but we also work directly with the end consumer. So if you go to our Adidas website and if something goes wrong, there is probably an incident that will be my queue. So that’s something that these direct to consumer is super important that we get direct feedback of what’s going on.

Tiago Barbosa: And this is really impactful. If something goes wrong for a brand or a company like Adidas, something goes down. It’s not only the potentially an e-commerce platform is not only the money that you are losing, but it’s also reputational damage that the brand faces. Right. Yeah. Cool. So one thing that I am curious about, and I don’t know if you can share a lot about this, but one of the things that I’m curious about is, what were the challenges that you faced that is kind of pushing you to move away from the current SREs being embedded in teams to the next or the evolution of that? So what were the main challenges that you identified or limitations that you identified that are kind of pushing you in this direction? Is it because software, so you are changing the way that you build software or is it some integration? What led to this?

Andreia Otto: Yeah, so there are a couple of reasons. When we started, we wanted to make sure that every area had a right attention and we reach into a point that some areas are super, super stable, and where we have other areas where we are moving to a transformation from monolithic to microservices where many, many changes are happening in those areas. And it’s where we need from the beginning to have the right people focusing on stability. So the whole purpose of changing slightly the operating model is to make sure we have flexibility to move capacity to move the focus where we need the most. Giving some examples that is when we talk about monolithics that are a lot of cons, but the pro is that it’s one big thing. So it’s like you make a request, you get a response for that big thing. When it comes to microservice, if you see from the SRE angle, something that was done in one request, now you’re going to have 12 requests with different responses and then it can be super challenging. This is one of the main challenges we currently have in this whole change of architecture, that small things that previously we are not really looking for or checking in detail. Are all the status codes standardized? So are we…to cover this flow, did we review if we have the logging standards all in place to make sure that every application has all the information we need? We also need to think about tracing, right? Because multiple microservices, do we have the right tooling to provide this tracing? Do we need something else? We have a platform engineering team and we are working close to them. How can we have capabilities that we didn’t have before because we didn’t need before. So the more we move to microservices, the more need. But in terms of stability, in terms of observability especially, there is a big complexity that comes with microservices that we cannot deny. So currently this is our main challenge, but we are moving and that’s the whole purpose of having flexibility to bring the best to where we need the most and make sure that we start on the right foot.

Tiago Barbosa: That’s critical of course. And one thing that you mentioned is that you have a platform engineering team. I do believe having the SRE team, basically because you and your team, you are the ones that basically need to make sure that reliability is there, you know the standards that you would like to see implemented. And typically platform engineering teams are the ones that create typically a bunch of assets. It might be like as you said, tracing SDK or anything from observability perspective that will allow developers to easily implement best practices. So you as a company can standardize on the HTTP status that you mentioned, but many other things that will allow you to easily collect information that you can work from. And for developers, of course you have basically the same way of working across all microservices. So this is how you typically work, right?

Andreia Otto: Yeah, absolutely. So the SRE is the bridge from the platform and development. So we are kind of in the middle, not only observability, but all CICD for example. So all the pipelines, all the infra that we deploy, the more that we collaborate with platform, the better and easier that our job is with the application. So we can focus on really on the application side and everything that is underneath we have covered, which is super handy. So we have multiple teams providing all the platforms that we can plug and use. That’s amazing.

Tiago Barbosa: Yeah, definitely. So Andreia, one thing that I would like to ask you is based on the dimension of a company like Adidas and the number of customers that I assume that you have. So we are talking about a platform that runs at a very large scale. And so one of the events that I can remember that might be one of your top events happening or one of the critical times for your team specifically is something like Cyber Week, right? Is this your main event? Is this one of your main events. what you do to prepare for a Cyber Week?

Andreia Otto: Yeah, definitely. I think the whole e-commerce industry Cyber Week is one of the biggest events. So I think each company has its own waves of events and what come and whatnot, but Cyber Week is definitely one of the biggest ones and we use it to have something like holiday readiness, but we are moving towards to always Ready. So because we never know, if you see last year the industry, they pre pawned some of the sales. So not everyone went on Black Friday or Cyber Monday, but they did before lot of campaigns. And the more that we see these behavior, the more ready that we need to be. So we are moving towards to be always ready. So how can we use leverage HPA, so horizontal pod autoscaling, how can we leverage the whole autoscaling capabilities that we have to be ready? But indeed, cyber week is a big deal, is a big thing, and we need to make sure that we are running a lot of load testing, a lot of collaboration because any company that has the dimension as we have, have multiple teams and to have one customer. So for you to go and add an item them to your basket and purchase something that are multiple teams behind the scenes, making sure that the whole collaboration works, making sure that we align with the markets and make sure that everything is ready is a big deal. Having the platform ready, working on auto scaling, it’s something that we do target.

Tiago Barbosa: Cool. And one thing that also on that topic, because you are kind of building this e-commerce platform, and this might be my personal perspective or maybe not, where typically e-commerce platforms have a lot of integrations, some of them built internally, others built externally, and the way that you handle reliability internally and externally might be slightly different. Is that the case for Adidas? And if so, what are you doing to prepare for that as well?

Andreia Otto: No, absolutely, and I think this is one of our main focus. We obviously have third parties, we obviously have those hard ones. If they fail, we have bigger problems. If there are some that if they fail, we can fall back. That is no big deal. But a big focus for the SRE team for my team today is how can we be resilient to our third parties, especially for those that break our main flows, customer flows, mapping and understanding what is blocking and can we have a fallback, can we have a degraded service? Can we have a circuit breaker? So can we do it automatically? Shall we do it via feature flag that someone needs to click a button? That’s a big topic that’s going on. And when linking to the previous question on Cyber Week, during Cyber Week is not only Adidas, it is worldwide, all the companies and if you see the providers, they’re also under a lot of stress. So we need to make sure that we are rally because something might fail. And that’s the rule number one, something might fail. The thing is how we can handle those failures and which of them will break our flow as long as we can, as much as possible, we are in a good shape. So the focus for my team, long story short, is the focus from my team is to ensure that we know the critical flows, we know the critical dependencies, and we have a mitigation in plan, be it something manually, be it something automated, but how can we be resilient in our ecosystem?

Tiago Barbosa: Yeah, you bring a very, very good point, which is the fact that, well, you are probably running in some cloud provider somewhere, but your third party providers are most certainly also running in a cloud provider somewhere. And we know that cloud providers also fail. There’s outages, there’s all of these things that we need to account for. So it’s not only the software that we are building, but all the infrastructure. That’s a very good point. And that leads me to one of my last questions, which is actually when something goes wrong, there’s an incident. And basically how do you handle that? Is it a separate team that manages this, does it go directly to the SRE team? And what is typically the process that you have? If you can share a little bit?

Andreia Otto: One thing that I’m linking to the very beginning when I explain it, that our systems are direct to consumer. So we take care of a platform that is available to end users globally. It’s always midnight somewhere and there is always someone trying to buy something. So whenever we have our prioritization metrics, so we have our level of incidents, P1, P2, P3 defined very well what qualifies for a P1, P2, P3 in our critical flows. So we have a team, horizontal team of service managers that I’m very proud of. This team puts all the processes in places and one of them is the incident and problem management. So being direct to the customer, we need to be very fast in reaction. So we have a ticketing system that tickets can be created by users, can be created by alerts. And whenever an alert or a P1 is created, we immediately have someone that receives the call out that this is created. You need to have a look and evaluate if it’s indeed a P1 or not qualified as a P1. So we have a process defined where we can create a bridge call. And in this bridge call is where we have an incident commander that is making sure that we have the right people in the call and also making sure that whoever is not needed can leave the call. So we don’t want to have too many people in the call. And then these incident commander is the one bringing people in and people out to make sure that we mitigate the problem. And the same person that is doing this commander figure is also sending updates via emails, emails and chats. And whenever we close, so it’s mitigated, we close the call of course, and afterwards we have a postmortem to understand the root cause analysis, what happened and the action plan, that’s the most important. So things will go wrong. When things go wrong we need to ensure that we know what happened, that we take actions and usually those actions will be backlog items for the product teams. The SRE teams will obviously be big part of it, and then we close the loop. So if something happens, it shouldn’t be the same thing that happened before and the same process flows.

Tiago Barbosa: Yeah, exactly. No, that’s really good to see. So one of the things that I see in my conversations with some customers that are PagerDuty customers is they handle all the incidents response side of things. They have automation to help them figure out the problems quickly and all that. But then the last step that you mentioned around running the postmortem and learning from the problem and improving based on that is something that I don’t see implemented everywhere. And I know that sometimes is a process that well takes some time, requires a lot of people to join a meeting to discuss the problem and the solution, the possible solutions. But in the end, I do agree that this is probably the most important part of incident management and incident response, which is like, okay, let’s make sure that this doesn’t happen again, like you mentioned.

Andreia Otto: Absolutely. I think that the incident and the whole incident and problem management only closes when we have the root cause analysis, the all done action items and when we actually close the problem. So if it happens again, it’s too bad. Right?

Tiago Barbosa: Yeah, definitely.

Andreia Otto: We need to learn from all the big failures.

Tiago Barbosa: Yeah. One thing that I wanted to ask you still on this topic is you mentioned that the service management team, they have a really important role in this incident management process. Are they part of your team as well or are they a separate team and they have people with expertise or specialized in different parts of your e-commerce platform or they’re more generic, let’s say?

Andreia Otto: No, they are part of my team and that’s where we could really focus and bring the maturity that we have today. So they’re part of the team, they attend our leadership calls and I mean they work closer to all the SRE leads because afterwards on postmortems, we have action items. Those action items are things that we also track. So it’s something that service management team also helps to keep pushing, let’s make sure that this is done. But in a nutshell, they’re part of the team and they play a major role.

Tiago Barbosa: Yeah, it’s important to have someone whose responsibilities to make sure that we follow the processes like end-to-end, and that all the feedback we collect goes back into the backlog so we can always improve. That’s great to see. So Andreia, this is all I had for questions on understanding how SRE works in Adidas in a global company with a lot of exposure to customers.

Andreia Otto: Thanks for the time. Thanks for the questions. And I think the SRE is always maturing. So as we started with DevOps, now we have the study implementing those practices, right? We are always evolving, we are always learning, and I’m really confident that now moving towards to having foundational team members will bring us to the next level. Let’s put it this way. And I hope that everybody that is watching also see that Adidas is super stable.

Tiago Barbosa: Yeah, I do agree. I’m a customer myself and I love the brand and my experience with the platform has been very good so far. So Andreia, once again, thank you so much for joining us today. All the best in your job at Adidas, and I hope you can smoothly move to the next step, as you said. All the best. And once again, thank you so much for joining us and sharing a bit of your knowledge and experience with us. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast. If you like what you’ve heard, you can find our show notes at pageittothelimit.com and you can reach us on Twitter at pageit2thelimit. That’s @pageit2thelimit. Let us know what you think of the show. And thank you so much for joining us. And remember on Eventful Days, are beautiful days.

Show Notes

For a full transcript of this episode, click the “Display Transcript” button above.

Additional Resources

Guests

Andreia Otto

Andreia Otto (she/her)

Andreia Otto leads the adidas Digital SRE and Operations team, driving SRE principles and DevOps practices to strengthen the resilience and reliability of the eCommerce platform. With a background in software engineering, Andreia has steered her professional journey into the domains of application lifecycle management and platform engineering. In her free time, you will probably find Andreia with her dogs hiking around Germany!

Hosts

Tiago Barbosa

Tiago Barbosa (he/him)

Tiago Barbosa is a Developer Advocate for PagerDuty. With 13 years of experience in the tech industry, he has helped hundreds of companies of various sizes and industries on their journey to build resilient and scalable cloud applications while working for Microsoft and AWS. Before moving to PagerDuty Tiago ran the Cloud and Platform Engineering teams for Music Tribe. When he is not busy working or travelling, he is most certainly spending some good time with his family, playing music or surfing.