Autonomy in Action: Agentic AI

Transcript

Sid: Hey everyone. Welcome to Page to The Limit, where we dive into real stories behind digital operations, instant response and everything in between. I’m your host, Sid, and whether you’re on call or off duty, we’re here to bring you fresh perspectives from the team that keeps the world always on. Let’s get into it. Imagine it’s 3:14 AM you’re the on-call SRE for a multi-tenant platform, and the pager just went off. Again, you glance at the alert service latency on your internal auth gateway just spiked five x baseline. Now, maybe you’ve seen this before. Maybe it’s a known issue tied to a deployment that finished 20 minutes ago, or maybe it’s something new. In edge case, a cascading failure, you haven’t debugged yet. Either way, you’re about to run through the same routine. Open dashboards, pull logs weren’t a curl command or dig a few times, go into metrics, maybe restart a pod, maybe escalate. Now, imagine that before you even swipe your phone, something else is already doing the triage for you. Not just surfacing what happens, but actively working towards fixing it. An AI agent that has already collected related deploys, correlated traffic anomalies compared recent incidents and suggest a roadblock waiting for your approval. Or maybe if it’s within a safe sandbox, it already ran the rollback itself and is watching to see latency drops. That’s the idea behind agen ai, and today we’re going to unpack what it really means in practical terms for people who operate software and production.

Now, let’s talk about how we got here. 10 years ago, automation in ops meant bash scripts, CR jobs, maybe a Jenkins pipeline for fancy tools followed rules. They’re deterministic. They were brittle and mostly silent until they broke. Then came ChatOps. Suddenly we had bots listening to slack channels, triggering deploys, flipping feature flags, restarting services, all scripted, still rule-based, but more collaborative. It felt more human, more conversational, but none of these systems were intelligent. They couldn’t reason, couldn’t adapt, couldn’t look at a noisy alert and say, I’ve seen this pattern before. It’s probably related to a DNS config that just changed. That required a person. But then in 2022 to 2023 around there, everything changed. Foundational models hit the scene, and LLMs were trained on massive amounts of data and engineers started wondering, what if we added reasoning to these systems? And then that’s where agent take I started to emerge.

Okay, so what is agent take ai? It’s not just a chat bot bolted onto a script, and it’s not just an L LM with a fancy wrapper, and it’s not a silver bullet for instant response. At a very basic level, agen AI does three things. It observes, so it pulls in context, states inputs, logs, and alerts. IT plans, it decides what steps to take towards a goal, and it acts. It executes those steps ideally within well-defined boundaries. What makes AG agentic is that it does those things continuously. It has the autonomy to keep moving toward a goal, even in the face of partial failures or incomplete data or even uncertain terrain. The whole point is to shift the burden of how to respond from humans to systems at least for the class of problems where it’s safe and helpful. Okay, so let’s go back to our 3:14 AM Instant.

You’re getting page for high latency on your off service Angen AI system wouldn’t just raise the alarm. It would recognize that latency spikes like this have previously occurred and correlate with memory leaks in a specific container version. So it would then check the pods, memory profile and frequency. It’ll notice that the current version is the same one rolled back last week for a similar issue. It’ll then generate a hypothesis. In this case, it would be something like this version is likely the culprit. It’ll plan an action, something like initiate a rollback, notify the on-call engineer and monitor latency post action. If it sees an improvement, it’ll mark the rollback as successful and it will close that loop. We’ll talk about that feedback loop in a little while. Let’s pause for a second though. That’s not one action. That’s a pipeline of perception, hypothesis, planning, execution, and validation.

It’s a behavior, it’s not a function, and that’s the leap scripts. Do tasks and agents try to achieve goals. Let’s unpack this architecture a little bit more. Most modern AI agents for ops follow a similar structure, even if they don’t advertise it that way. You can think of it in layers. So the layers are perception, planning, tool use, memory and control logic, and we’ll talk about each of them. The first one, perception. This is where the system gathers input. So stuff like logs, traces, metrics, alert payloads, maybe even slack messages or ticket data. The LLM isn’t reasoning blind. It’s grounded in telemetry. So the second layer is planning. Now this is where the LLM comes in, given the input that we talked about in the perception layer, it builds a chain of thought sometimes literally using technique like React or Tree of Thoughts to break down the task.

What do I do? What can I check and what might help? And then we jump into the tool use. This is the third layer, also known as the action layer. The agent picks from a defined set of tools, restart a service, call a diagnostic, API follow Jira ticket where Elle command whatever. Each action has a defined interface and a permission scope. So now we go to the fourth layer, which is memory. Some agents maintain working memory, kind of like a scratch pad of what’s been tried, what worked and what didn’t. That allows them to retry or adapt without looping forever. And the final piece in this or the final layer is control logic. This is the runtime that coordinates everything. It calls the LLM enforces tool access, it handles error states and critically knows when to stop and to when to escalate. So no, the AI agent isn’t just a prompt.

It’s a little autonomous system and building one that’s reliable and safe and production ready takes real engineering work. Now, here’s the hard part. Things break. LLMs hallucinate all the time. As you have probably seen yourself. Tools fail and observability at times is partial, and the goal-driven agents sometimes go rogue in pursuit of their objectives. So an example, if you give an agent the goal to resolve an incident, and the fastest way to stop the alert is to suppress it. Guess what it might do? Probably just suppress the alert, and that’s why guardrails exist in a well architected system. You should define tool scopes. With strict permissions, you should require approval for anything destructive. Track every action the agent takes and expose it all in logs or in some sort of UI and include shutdown conditions. If the agent hits three dead ends, guess what? We’re escalating to a real person.

These aren’t nice to haves. They’re operational safety. So if you think of the AI agent, kind of like a junior engineer who works fast, never sleeps and occasionally gets confused in really creative ways, you wouldn’t give that person root access on day one. Okay, so we’ve chatted about this 3:14 AM example. I’ve given you some background as to how agent AI came to be and what steps it kind of takes to go about reasoning and actioning, but where does that exactly leave us? Well, if you’re an Ops Mind person and you’re curious about agentic ai, I wouldn’t recommend starting with something that is super hard to solve, start with a pain point that already exists today. So try figuring out something that your team is already sick of doing over and over and over again. Something like triaging, noisy alerts, summarizing instance, or even just running routine diagnostics.

Once you’ve found that, build an agent that tries to help with just that one thing. Give it read only access at first. Log everything. Watch what it suggests, learn from it, and then slowly, carefully give it the ability to act. Maybe in the beginning, give it a sandbox with some training wheels. The win here isn’t about speed, it’s about consistency. It’s freeing up humans to handle the edge cases while the system handles the routine. Now, the end state isn’t that the AI fixes every problem. It’s that it helps you sleep through the ones that you’ve already seen a hundred times, like our three 14 example. Okay, let’s wrap this up. AG agentic AI in production is coming fast, frankly, with PagerDuty. It’s already here. Now, it won’t replace ops teams, but it will change how we think about instant response, about automation, about who and what is allowed to act in our systems.

And for those of us who are super obsessed with uptight, which I would hope everyone is, it’s worth considering with open eyes, with clean logs, and with a healthy sense of exploration. Thank you very much for listening. You’ve been tuned into Pages To The Limit. I’m your host, Sid, and today we got a little bit closer understanding what it really means to let software operate itself with ai. I hope you go out there and build a little AI agent of your own. If you do, I’d love to hear about it at PagerDuty Commons, which is community.pagerduty.com. Thank you again for listening. Thanks for tuning in to this episode of Page to The Limit. If you enjoyed this episode, don’t forget to subscribe and share it with someone who would find it valuable. Thank you for listening. Until next time, stay resilient and keep those pages light.

Autonomy in Action: Agentic AI

Transcript

Show Notes

Additional Resources

Guests

Hosts

Sid Verma (He/Him/His)