Monitoring to Observability with Satbir Singh

Transcript

Mandi Walls (00:09): Welcome to Page It to The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter. Welcome back folks to Page it to the Limit. With me today I have Satbir Singh, one of my fellow ambassadors at the DevOps Institute under its new name PeopleCert, and we’re going to be talking about monitoring and observability. It’s been a while since we’ve talked about observability on the show, so it’s time for a refresh, especially now that everything might have AI in it, so we have a new perspective. So welcome to the show. Tell us a bit about yourself and what you do.

Satbir Singh (01:01): Thanks for having me, Mandi. It’s great to be here. I’m Satbir Singh. I work at Cisco where I focus on observability, AI ops and automation, particularly around AppDynamics and integrations. So I have been in software industry for close to 18 years with a strong focus in customer engineering and observability.

Mandi Walls (01:28): Excellent.

Satbir Singh (01:31): Beyond work at Cisco, I’m also a senior IEEE member and I also mentor students at some US universities for their capstone projects to help them in job hunt after college or prepare them for new technologies challenges.

Mandi Walls (01:53): Excellent. That sounds like that keeps you pretty busy, I’m sure.

Satbir Singh (01:57): Yeah.

Mandi Walls (01:58): Awesome. Alright, so for folks who would like a bit of a refresher or for folks who’re new to the space, a lot of folks are probably familiar with say monitoring, but what’s the difference between monitoring as we think of it traditionally and observability?

Satbir Singh (02:14): I would say monitoring is looking at your car dashboard, which tells you whether the fuel is low or engine is overheating, but observability is more like you can pop the hood and see exactly what’s the problem in the engine. So monitoring basically tells you if something is wrong, but observability helps you figure out why it’s wrong and what’s happening basically get to the root cause of a problem.

Mandi Walls (02:47): Excellent. So I mean obviously as we get more and more complex systems, we need both of those things. Why do you think observability has become such a hot topic in the past couple of years?

Satbir Singh (03:02): In last decade or so? The software are becoming increasingly complex with microservices architecture then containers and cloud environments and things are constantly changing. So old monitoring systems are not designed to handle these kinds of application and that’s why the modern observability, it helps us dig deeper into these complex application and find the root cause quickly and basically help resolve all those production issues.

Mandi Walls (03:42): We want to know what’s going on when someone makes a request and it’s going to go through, it’s going to ping pong through all these other systems and all the things that could potentially hit along the way and where that request goes, you can’t figure that out from just a thing that pokes. It says is it up? Is it up? Is it up? Is it giving me two hundreds? Is it there?

Mandi Walls (04:00): We need all this additional stuff. As we move between these two things as the practice evolves, what are some of the challenges you see when you work with organizations? What do they face when they’re moving from traditional monitoring and maybe augmenting that to also include observability into their space?

Satbir Singh (04:20): I think the hardest part is it’s more about the culture.

Mandi Walls (04:25): Okay.

Satbir Singh (04:26): Yeah. So teams have to shift from just reacting to alerts to actually exploring data and asking questions. So traditionally the companies have too many tools which have logs and events data and they are not talking to each other, they are in silos. So basically observability helps in correlating in all these tools and gives us a broader context of where exactly the problem lies. And besides that observability can be seen as a cost and it’s not seen resilience investment. And I think changing that culture, I think those are some of the challenges that I see in this space.

Mandi Walls (05:20): For folks who are, for the application developers is they’re sort of putting this stuff together. I feel like in the past you’ve just kind of got monitoring it, it just happened once things got into production. Observability I think requires a lot more conscious decision-making in the application itself and it is a design and development process. You can’t just bolt it on at the end. It has to be intentionally put there. Do you find folks struggle with that? It feels like you say it’s an investment that has to be made upfront.

Satbir Singh (05:54): So the struggle is getting used to the new tool and finding value in it because there is a time to set up observability in your environment. Then there might be some issues initially where the agents, sometime it happens, if it is not configured properly, the applications might go down down and that can cause initial fear in the teams. So those are the challenges that can happen while setting up observability in an environment.

Mandi Walls (06:25): And once you get it there, how does observability help or change the way you see folks do things like incident response for example? Does it help them? Is it just something that they have to learn to do that’s different? How do you see that play out?

Satbir Singh (06:44): It is more of, it actually makes their life easy because with observability tools, a lot of incidents, one thing is getting to the root cause, root cause getting to the root cause is quicker because we are already correlating our metrics, events and traces so we exactly know where the problem lies and we know which are the breaking points in the application. So developer can be more focused on those pieces and make them more resilient and in observability, that is a feature that I see in PagerDuty as well where different similar incidents around services can be grouped together so we don’t have alert fatigue and duplicate events can be merged. So basically when we are firefighting or there’s an incident going on, we don’t have to look into different issues. So I think it definitely makes the life easy, but initial there might be some hesitation as with every change, there might be some hesitation with bringing in new technology in the system.

Mandi Walls (07:51): Absolutely. Some folks are big to learn whatever comes across their desk and other folks are comfortable where they are and they’re like, ‘I dunno if I want to learn something new right now’ comes across. Sure.

Mandi Walls (08:05): You mentioned what we call AIOps, the ability for the machine learning behind a platform like PagerDuty to group alerts together into the same incident. So we call that AIOps, I think it was originally coined by Gartner in that particular role. When we deploy that with observability, how is it an enhancement? It feels like sometimes it can squash a lot of things as far as maybe suppressing information that we might want to know, maybe a little bit more subtlety that way. How do you see folks deploying these kinds of things together into their environment?

Satbir Singh (08:45): AIOps is actually perfect partner for observability. So when it’s done, it cuts down a lot of by grouping related alerts and actually surfacing the real issues faster. I think the challenge is trust. So the engineers need to trust the AI because AI can sometime hallucinate and the engineers don’t understand why it is doing what it is doing. So I think if it is transparent and it reduces the noise, definitely there’s a great combo.

Mandi Walls (09:22): Yeah, excellent. And how does that help folks or does it help folks get from, it’s hard, especially in complex environments, you have a lot of firefighting, you have a lot of reactive tasks and things that come across as incidents and things that happen. How does these expanded practices help to move folks? Does it help them move to more proactive engineering to think about more resilience upfront? How do you see folks improving their overall maybe the overall reliability of their systems with these kinds of tools?

Satbir Singh (09:55): Yeah, yeah. Actually observability, it gives a chance to engineers to catch the warning signs early. We have metrics like response time and error rates and calls per minute. If we see anomaly in these metrics from baseline, this generally signify that there is some problem in the system and fixing these problem before customers even notice this definitely adds a culture of resilience instead of constant firefighting. So definitely it definitely help in making the teams more proactive and resilient.

Mandi Walls (10:40): When do folks have to start thinking about this? Is this early design phase of an application that they’re thinking about this as sort of the target environment that they want to have? We call ’em operational features or operational requirements. Yeah, we want this to be awesome in production. How soon should we be thinking about these things as we’re creating new applications?

Satbir Singh (11:02): So when we are designing applications, so we have some standards around open telemetry. So when we are designing applications, if we should make sure that our applications support open telemetry because once it supports an open standard, then our application can send metrics to any product, then we are not logged in with any specific product for our monitoring requirements. So I think if we take care of this during our development itself, then definitely it’s a great, it is a lot of benefit to the team in long run

Mandi Walls (11:46): For sure. Making those investments early before things get to production is always than trying to fix it later once it’s already out there anyway.

Satbir Singh (11:55): We have observability tool which can monitors application even if they are not developed, keeping open telemetry standards in mind. But definitely as we are making progress in observability space, we are developing open telemetry standards and these are some things that will be much more mature in the future. And if applications are already supporting them, then definitely increase the resilience of an application observability needs in future.

Mandi Walls (12:28): Absolutely. That’s awesome. Yeah, there’s been a lot of work in the OTel space in the past couple of years, been really interesting to watch that finally emerge. It feels like it’s kind of late to the party almost like we’ve been in these complex environments for so long, all these different tools and now finally you’re just like, okay, let’s align on something as things have been crazy for sure. So if companies say they’re doing observability, what kind of indicates that they’re kind of doing it right? Sometimes we see folks that are like, it’s really just monitoring with fancy dashboards rather than actual true observability. What kinds of things are good indicators that folks have made that shift from traditional monitoring to the new way? The observability?

Satbir Singh (13:22): Yeah. Yeah. I think one sign is when engineers can ask new questions or resolve issues without creating extra dashboards or without doing new customization to find the root cause of whatever problem is ongoing. So if engineers can do that, then definitely I think that’s a great progress in observability and if there are different tools and now we can correlate all those metrics, events and traces and find out problem instead of going in separate tools and looking out for problems. I think that kind of shows that we have made good progress in observability.

Mandi Walls (14:09): Yeah, excellent. You mentioned two words there, events and traces. Can we dig into those a little bit for folks who may not have heard them before, what are events and what are traces in the context of observability for applications?

Satbir Singh (14:22): So events are basically whenever we set up alerts in our observability tool so they can send out some kind of signals saying that there’s a problem. So that’s what we call events. So this terminology can change in maybe in some product they might say differently, but even it’s basically the signals that observability system generate when there’s a problem based on whatever conditions we have set up in the system. So if we say the calls per minute go above a hundred, then I want an alert. So that is kind of an event. And traces is a map of various services, how they’re interacting with each other, which service is calling, which other service, what are the type of calls between those services. So that is stress. Yeah,

Mandi Walls (15:16): That’s your map through the complexity.

Satbir Singh (15:19): Yes.

Mandi Walls (15:20): Very cool. So what do you see, this stuff has evolved so quickly over the past couple of years. Where do you see it going in the next say three to five years? Is there stuff on the horizon that it looks like is a sure thing that we’re going to see in this space in the coming years?

Satbir Singh (15:39): Yeah, I think we definitely see more automation as we have advanced AI models, but I think humans won’t disappear from this. Humans will definitely be guiding the observability field as a field, but automation and AI will power a lot of backend, simple issues will self-heal, but engineers will still need to step in for complex and unusual problems because there will still, in real world, there will still be systems where AI cannot log in into a server and install a application because it needs some special permission. So human will still be needed, but I think AI and automation will power a lot of backend. That’s my,

Mandi Walls (16:37): Yeah, it’s fascinating. I remember the first time we got mired in this self-healing idea and it was maybe 2006, it’s been some time when that was sort of the thing that was going to be happening. I think we’re finally going to get there for some of our well understood stuff, especially with some of these newer models that are coming out. I feel like I am not sure how much folks are going to trust all of them right off the bat. Definitely if my LinkedIn is any indication, folks are in definitely two camps on some of this stuff for sure is who’s going to trust what and what’s going to come forward. But for teams that are overburdened and have too much going on already, having those extra AI teammates to help them out with, especially for easy stuff during incident response would be super helpful for a lot of these folks. There’s a lot going on out there for sure. And speaking of that, what does observability, how does it impact folks who are on call? So you have incidents, they happen at any time, they happen all the time. Some of them can be super easy, just need to restart something, other stuff may need some deep debugging still other things may include a third party somewhere. There’s just a lot of stuff that’s going on.

(18:02): But how is observability helping these folks out in terms of dealing with their on-call burden, if you want to call it that way?

Satbir Singh (18:14): As we were discussing earlier, so now with a lot of AI features, now observability tools can group a lot of incidents together. If they’re from similar service and a lot of events which are duplicate, they can be merged so that there’s less alert fatigue. Definitely in a firefighting situation when there’s a critical issue going on, any help is welcome. And when we have less alerts, that’s definitely great. So now that we have RCA feature where the observability tool can help us understand what went wrong, how can it be taken care of in the future and how to make this a service more stable. So definitely this is helping on-call engineers a lot, reducing this basically it means fewer sleepless night, less burnout and more trust on the overall on-call process. So I guess definitely it is helping a lot.

Mandi Walls (19:24): Yeah, that’s an interesting thing you said there more trust in the on-call process. We haven’t really explored that too much. It’s kind of a foregone conclusion that yeah, we’re going to have some kind of on-call because we have these 24 by 7, 365 services that we have to maintain and deal with and things happen and sometimes it feels like why are we doing this to ourselves? Oh my goodness. But now we’re finally at a point again where we’ve got maybe enough tooling to really make a difference in that, the impact of that burden really for a while it felt like we were kind of in a trough where we were relying on humans quite a bit and the tools weren’t really catching up, but it feels like we’re in the upswing again

Satbir Singh (20:11): With so much advancement in AI and I think the tools are getting better and better and I see a lot of features that, I mean we have some features that I think I want to use, but we don’t have licenses for them yet.

Mandi Walls (20:30): Oh, okay. That’s always a bummer.

Satbir Singh (20:33): Yeah, I would love to use those features because they seem really exciting.

Mandi Walls (20:39): Yeah, definitely. There’s so much stuff moving around. So for the last little bit, one of the things that I was just at Kansas City DevOpsDays and Kansas City Developer Conference and I gave a couple of talks and I asked people, I asked the room, how many of you are doing post-incident reviews or post-mortems or just any kind of learning off of incidents? And I was so disappointed in how many people raise their hands. There’s very, very few. And I was like, everybody in this room has homework because you are missing out on a lot of learnings off of not doing this kind of stuff. How do we relate what we learn in observability into what we can then take downstream for ourselves and talk about and learn from in our postin insert reviews? Is there stuff that we can be super conscious about for teaching everybody else maybe when our service has a blip that we learned from the observability tools? Are there any other tricks there that you’ve seen folks do?

Satbir Singh (21:44): So you’re saying what are the points we need to discuss in postmortems?

Mandi Walls (21:49): Sure. Is there stuff that maybe changes as part of our, do we use them? Is there a place where we’re reinforcing what we learn in observability or there’s practices sort of feedback into each other as we learn stuff and then talk about it in a post-incident review?

Satbir Singh (22:05): So a lot of times observability gives us pointers in what was the problem, what was the root cause, and over time it can generate a report that these are the common areas where we are having problems and specifically, so basically it gives us pointer around areas where there is scope of improvement in our services and basically either it can be in terms of allocating more infrastructure for those services if it is due to hardware issues or the coding standards are not up to mark while writing those services. So these are some of the areas that I feel once we get a postmortem from an observability tool, which it can help us point out in the right direction where we need to focus in our overall development process.

Mandi Walls (23:08): Cool. So there’s a lot of things that we can learn across the life cycle of the service there. For folks who want to get into these kinds of practices or want to start adding observability to their systems, do you have any recommendations or pointers for how they can get started or what to think about?

Satbir Singh (23:28): I think we should start small and take one critical service and instrument it with something like Open Telemetry, gather logs, metrics and traces around it and tie that into our incident response system and if we can monitor it, how it improves our overall resolution time and if we see success around it, that definitely makes it easier to expand it across the organization because that can be our selling point because it costs money to spend on observability. So I think I would suggest start small.

Mandi Walls (24:12): Yeah, definitely. That’s sort of the plan for any kind of major migrations. You start small and celebrate those wins and show everybody else and then hope they all want it as well. Right. It looks amazing and it’s going to be so helpful. Yeah, definitely. Well, very cool. This has been great. Anything else you’d like to share with our audience before we wrap up? Any bits of advice or insider knowledge or interesting stuff there?

Satbir Singh (24:40): I would like to say that observability isn’t just about tools and I think it’s more about people. It gives engineers ability to ask any questions about their system, reduce downtime and build resilience in the system. So basically observability turns firefighting into continuous learning and improvement. That’s my understanding and that’s my takeaway message for this.

Mandi Walls (25:12): Yeah, that’s perfect. Yeah, turn it into learning and understanding. Love it. That’s what we want. We want folks to enjoy their work and also be in a place where it improves over time. For sure.

Satbir Singh (25:25): Yeah,

Mandi Walls (25:25): It’s been great. Thank you so much for joining me today.

Satbir Singh (25:28): Thank you, Mandi. Thank you for your time.

Mandi Walls (25:30): Excellent. For everybody else out there, thanks for joining us. We’ll be back in a couple of weeks within another episode, and in the meantime, I’ll wish you all an uneventful day. That does it for another installment of Page it to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast. If you like what you’ve heard, you can find our show notes at page it to the limit.com and you can reach us on Twitter at page it to the limit using the number two. Thank you so much for joining us and remember uneventful days, our beautiful days.

Monitoring to Observability With Satbir Singh

Transcript

Show Notes

Additional Resources

Guests

Satbir Singh (he/him)

Hosts

Mandi Walls (she/her)