AI Agents in Production: starting with the fundamentals

MLOps.WTF Edition #21

Dec 04, 2025

Ahoy there 🚢,

Matt is back! Episode #21 is brought to you by Matt Squire, CTO, Co-Founder, Fuzzy Labs.

How do we deploy software that thinks for itself?

It’s a common theme in this newsletter that things change quickly in the world of MLOps. According to Google Trends, the term itself only gained popularity in 2019. Back then, the hard thing we were all grappling with could be summarised like this: how do we deploy and maintain software that’s fundamentally non-deterministic?

This description applies to all the traditional ML things that we know and love, like recommender models, sentiment scoring, image segmentation, etc. And it applies in the same way to the first wave of generative AI applications, such as RAG (see issue #4). Non-determinism in ML comes from a few places; randomness during training, gradual data drift during inference, and (especially with LLMs) stochastic generation used as part of producing a model output.

That stuff is hard enough to deal with, but agents are much worse, because they add an entirely new dimension to the challenge: agents can reason and follow complex workflows. They can act, and interact with the world in ways that compound unpredictability. A traditional ML model makes a prediction on request, and then it sits there waiting for the next request. But an agent makes a prediction, then takes an action, observes the result, and it can keep going, potentially dozens of times in a single run.

As MLOps practitioners, how do we approach this challenge? In this article I’ll introduce some of the emerging ideas and tools for running AI agents in production — AgentOps, if you like.

Agentic workflows: or fancy while loops

To begin with, I’d like to demystify this word ‘agent’. The term has been around in AI research since the 1980s, but it was researchers like Pattie Maes at MIT’s Media Lab who brought it into the mainstream. When Maes launched her Software Agents Group in 1991, she defined an agent as a program that could act autonomously on behalf of a user or another program.

(Photo credit: Susan Lapides, 2013 - Pattie Maes)

Nowadays, ‘agent’ refers to a specific way of using large language models with tools—and the mechanism is remarkably simple.

Suppose we want an AI to assist with booking meetings. You could prompt an LLM with everyone’s calendar slots and ask it to reply with a suitable time. That works, but what if we need more information from the user, or additional data from the calendar system?

Instead, give the LLM tools and let it make its own choices:

“You are a calendar booking assistant. The user wants to book a meeting for Alice and Bob this week. You may:

a) you can ask to see a user’s calendar: <tool:calendar,user name>;

b) you may ask the user for additional clarification: <ask:question>

c) propose a meeting time along with <done>”.

To make this work, we need a program — let’s call it a workflow orchestrator — that interprets the LLM’s responses and acts on them. After each action, the runtime calls the LLM again with the results: “On the last turn you asked to see Alice’s calendar. Here are her available slots: [...]”. The LLM decides what to do next—maybe it needs Bob’s calendar too, or maybe it can propose a time.

This continues in a loop until the LLM returns <done>.

That’s the core idea: a while loop where the LLM decides what happens next. This basic structure is what powers our coding assistants, research tools, etc. By giving the LLM the power to pursue a goal autonomously and make decisions based on what it observes at each step, we end up with an agent.

By the way, if you’re familiar with the concept of continuation passing in programming, then you’ll notice some similarities here!

From loops to workflows

The calendar booking example above illustrates the concept, but in practice it’s very limited. What happens when the LLM makes a mistake and needs to backtrack? What if you want multiple agents working in parallel, perhaps one checking calendars while another drafts a meeting agenda? And what if a human needs to be in the loop, say by approving the proposed time before committing?

What we really need is a workflow framework. Tools like LangGraph (from the makers of LangChain) and CrewAI take the basic while-loop pattern and add the structure you need for production: state management, branching logic, error recovery, and orchestration of multiple agents or steps.

LangGraph, for instance, lets you define your agent as a directed graph where nodes represent actions (e.g. call the LLM, invoke a tool, wait for human input) and edges represent transitions between them. The framework can persist state, so if your agent fails during a complex process, you can resume from where it left off instead of starting again.

Using tools

Workflows are what enables an agent to reason sequentially, i.e. to work through a task in multiple steps. But our agents also need to observe and act, and to do that, they need access to tools. Tools give agents access to things like databases, file storage, and APIs. In the calendar booking example, we glossed over exactly how tool calling works, so let’s take a closer look at that now.

For an LLM to make use of tools, we need to agree on two things: firstly, how do we describe a tool to the model? Secondly, when the model wishes to invoke a tool, how should it communicate its intentions back to us?

In other words, we need a protocol, and Anthropic’s MCP (model context protocol) has become the standard way to describe and interface with tools. Each tool has an MCP server which knows how to talk to that tool. Workflow frameworks use an MCP client to talk to these servers.

The standardisation that MCP brings is important particularly because it means we can swap out tools without re-writing the agent, and different workflow frameworks are now interoperable with the same tool integrations.

Deploying agents

At first glance, deployment looks straightforward. Components related to workflow orchestration, as well as your MCP servers, need to be deployed, scaled, and monitored. We need infrastructure, CI/CD pipelines, central logging… so far, so good.

In traditional ML deployments we usually assume a single inference step. So, you send data to a model, get a prediction back, and you’re done. But as we’ve seen, that’s not how agents work. A single request from a user might trigger ten individual LLM calls, along with three API requests, and a database operation.

That means your deployment needs to handle long-running processes, manage state between steps, and deal with failure gracefully. What happens if our agent is half-way through booking a meeting and the calendar API times out? Should it retry? How many times? In the end do we fail the whole workflow, or save and resume later on?

We can make life even harder by introducing multiple agents that need to coordinate in order to accomplish more complex goals. How do these agents share state and agree on task orderings?

The good news is that these aren’t new problems in software engineering. Ultimately, we’re talking about the challenges of distributed systems. Statefulness is the enemy, so we need to avoid it as much as possible. MCP servers, for example, should most definitely be stateless. Workflows are stateful by definition, and Frameworks like LangGraph include helpful features like state persistence and recovery.

For the multiple agent case, there are emerging standards designed to help with the coordination problem — in particular Google’s Agent-to-Agent protocol.

Observing and monitoring agents

Once our agents are running in production, we need to understand what they’re actually doing.

Traditional ML monitoring is concerned with things like model drift, the distribution of features, and the accuracy of predictions. These are still of some interest — for example, we might want to track drift in the content of a typical user query — but the focus shifts more to tracing the agent’s reasoning chain. What tools did it call? What did they return and how did the LLM interpret the tool response? What decisions were made?

This is harder than it sounds, because a single agent run might involve many LLM calls, each with its own context, system prompt, and settings. Traditional logging isn’t quite enough, because we need to connect together every step within the run.

Tools like LangFuse are designed to help with this. Langfuse will keep track of LLM calls, along with tool invocations, as well as embeddings and retrievals. It also provides the means to manage and version control prompts. Another tool is LangSmith, built by LangChain, although worth noting this one is not open source.

Evaluations for agents

While observability tells us what our agents are doing while they’re in production, evaluation is how we determine whether an agentic workflow is correct, as well as safe and secure. Ideally, we want to run evaluations prior to any deployment. Think about it as a full end-to-end system test.

There’s an emerging discipline around evaluating the outputs from an LLM, which we recently wrote about in edition 19. As well as standard or “happy path” inputs, we want to test edge cases, and adversarial inputs (e.g. trying to break the guardrails or safety features). Because LLM output is stochastic, to evaluate the outputs we often need to use semantic similarity scoring, or even use another LLM to judge an output.

As we’ve seen, with agents, we aren’t just dealing with single LLM invocations. We also need a way to evaluate a whole workflow. Take our calendar booking example. Success isn’t just “did it book a meeting?” You also need to know: did it check the right calendars? Did it ask clarifying questions when needed? Did it handle conflicts gracefully? Did it book a meeting at a time that actually makes sense?

A tool like Evidently AI provides the functionality for evaluating individual LLM calls, but it also supports evaluations at the workflow level. For example, tracking workflow progress and failed steps.

A key thing to remember is that evaluation doesn’t just happen once. It’s something that should be done every time you want to deploy a change. In agentic applications, a small change can have far-reaching and hard-to-predict implications. Additionally, many of the techniques used — like LLM as a judge — can also be used within live monitoring in order to flag up problems in production.

Where next?

This has been an overview of what the MLOps landscape looks like for agentic AI. But this is a big topic, and we’ll be following up with some deeper dives into agents in the next few editions.

But, to round up, one final observation I’ve made is just how much the challenges of agentic AI engineering resemble those of distributed systems. I think we can expect to see more and more influence from the world of distributed systems showing up in the future.

And finally

What’s Coming Up
Our next MLOps.WTF event is living on the edge, or specifically for edge AI should we say. Details yet to be fully released but tickets will sell out - if you want to join us, get in early!

🗓️ Meetup #7. 22nd January 2026 x Arm👇

Get my MLOPs.WTF ticket

About Fuzzy Labs

We’re Fuzzy Labs. A Manchester based open-source MLOps consultancy, founded in 2019.

Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control.

We’re growing fast, and hiring the following roles:

If you, or someone you love, enjoys building reliable ML systems and doesn’t mind the ~~odd~~ frequent debate about coffee brewing methods, have a look at our careers page.

Open Roles

If this edition was useful, pass it on. You can also find us on LinkedIn, where we post updates, videos, and the occasional explanation.

Not subscribed yet? Strange. The next edition will be our agents deep dive – workflows, coordination, make sure you’re signed up to get it to your mailbox.

MLOps.WTF by Fuzzy Labs

Discussion about this post

Ready for more?