AI Agents in Production (Part 2): Workflows
MLOps.WTF Edition #23
Ahoy there 🚢,
This episode is brought to you by Shubham Gandhi, MLOps Engineer and run club enthusiast at Fuzzy Labs.
Last episode, Matt introduced some of the challenges teams can expect to encounter when productionising AI agents. Agentic applications fundamentally differ from traditional software and ML systems in how a request is executed end-to-end.
Rather than a single prediction, agentic systems run multi-step workflows with iterative loops of reasoning, action, and state. Agents maintain context, make decisions conditionally, and adapt their behavior as execution unfolds. A single request can cascade into many tool calls, data retrievals, and intermediate decisions. Each of these steps introduces new failure modes, dramatically expanding the surface area where things can go wrong.
Agentic applications introduce a challenge in how to observe, debug, and evaluate such systems. How do you build confidence in a system that is inherently non-deterministic? In this newsletter, I’ll share what we’ve learnt about managing agents and workflows in Fuzzy Labs’ customer work.
A workflow by any other name
But first, we need to talk about terminology. Because agentic AI is such a new field, it’s inevitable that different people will use the same words to mean subtly different things. Unfortunately, workflow has different meanings depending on who you ask.
In our previous edition, we discussed agentic workflows, and what we really meant by that was the control loop that sits behind an agent. For each loop iteration, the agent’s model is given a prompt along with some context, and it is given the opportunity to take an action — like calling a tool, or updating its memory.
In a recent article from Anthropic — Building effective agents — a workflow is defined very differently. For Anthropic, workflows and agents are mutually exclusive concepts. Quoting the article:
“Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.”
The distinguishing factor is autonomy: workflows can’t make decisions about what to do next, but agents can. For this article, we’re adopting Anthropic’s definitions.
Workflows vs agents (to be, or not to be)
LLM workflows are predictable and consistent if the task is well-defined. Some examples include categorising customer service queries, translating a document, or generating or summarising reports from a database. The patterns range from forwarding queries to an LLM and getting back a response, to more complex chaining and routing involving multiple steps, or stateful multi-turn workflows.
Agentic patterns, on the other hand, involve autonomous loops of LLM reasoning and tool usage, where the system dynamically decides what steps to take in order to achieve a goal. Some examples include a sales coaching assistant, open-ended research, or multi-tool problem solving. There is no predefined code-path; agents go in an endless loop using their available tools and knowledge to collect all the necessary information required to complete a task.
Even though agents are fashionable right now, not all LLM applications need to use an agentic pattern. There are lots of simple workflows that may solve your use case.
There are various options on how to implement these patterns. Popular libraries for implementing workflows include Langchain, LlamaIndex. For agents, we have used the open source Pydantic AI library in most of our work, but besides that there are lots to choose from such as Google ADK, LangGraph, CrewAI and Parlant.
Reproducibility and evaluation
When we deploy agents, we pay particular attention to reproducibility: for any actions and decisions made by an agent, we want the ability to look back and understand how the agent got there. Without that, debugging becomes increasingly difficult, and moreover we have no ability to explain outcomes or measure performance.
By introducing experiment tracking, we can version the prompt, dataset, models, code and various metrics. It also allows us to keep track of traces. Traces are records of all actions, messages, tool calls, reasoning, and intermediate communications, tracked across the lifecycle of a request. Traces are invaluable for debugging and provide insight into the actions an LLM takes in generating a response. The popular open source tools that we have used are MLFlow and Langfuse.
With a reproducible foundation in-place, the next critical component we need is an evaluation framework. The idea here is to perform error analysis, collect 50 - 100 examples of where the application is failing, ideally through real user conversations. If you don’t have any data, you can also generate synthetic data to get started.
These examples serve as an evaluation dataset. The evaluation process for agentic workflows can be broken down into two steps. First, we check whether the overall task was successful. Second, we perform a step-level diagnosis: checking whether tools were selected appropriately and whether the agent recovers from failures. For workflow-based patterns, error analysis needs to be targeted at each stage of the workflow. For more advanced cases, LLM-as-judge evaluators can also be included. There are generic LLM-specific evaluation tools such as DeepEval, Deepchecks and Ragas.
To summarise, by this point we have an orchestrator, an application-specific evaluator, and an LLM-specific experiment tracker with tracing for debugging. Together, these enable us to confidently iterate and improve the performance of the agentic application. Because LLMs are susceptible to hallucinations and prompt injection, one common outcome of evaluation is the addition of guardrails around inputs and outputs to catch and flag issues early.
Production considerations
In this article, we’ve discussed some of the fundamentals of MLOps as they apply to tracing, reproducibility, and evaluation for agentic systems.
There are plenty of other considerations for productionising agentic applications. Defining clear success metrics at the start of a project is important if we want to meaningfully evaluate performance - the frameworks don’t do the thinking for us here. As a project evolves, we also need to consider increased complexity in observability and telemetry, alongside more sophisticated guardrails and safety controls. On top of that, the standard set of application monitoring still applies.
What’s next? (all the world’s a stage)
Over the next few editions, we’re going to dive into some of the most important topics in agentic AI and AgentOps. We’ll cover multi-agent systems and agent-to-agent protocols, explore evaluation and testing in greater depth, look at fully self-hosted agentic applications, and cover safety and security — which may turn out to be the most important emerging topic in this field.
Agents are still very new technology, and we’re constantly learning and refining our approach to AgentOps. We’re keen to hear your own experiences and lessons learned, so please get in touch and let us know.
Shubham, (the perfect dude) is a master of AI with a passion for machine learning engineering and MLOps. He holds a Master’s degree in AI, enjoys running, and believes the best solutions are usually the simplest ones.
And finally
What’s coming up
Our next MLOps.WTF meetup is happening on 22nd January 2026, hosted by Arm - and it’s now sold out!
If you’ve got a ticket but can no longer make it, please cancel so someone on the waiting list can take your place. And if you missed out, it’s still worth joining the waiting list… just in case.
This one’s an edge AI special, focused on what actually changes when models move out of the cloud and into the real world: tighter constraints, harder debugging, and failure modes you don’t see coming until you ship. We’ll be hearing practical stories from Arm, Fotenix, and Fuzzy Labs on what it takes to run edge AI systems day to day.
🗓️ Thursday 22nd January 2026 — Manchester
We’re also headed to our first BIG event, AI & Big Data Global on 3-4 February. Matt will be joining a panel at the conference, digging into what it really takes to take AI systems from prototype to production.
🗓️ 4–5 Feb 2026 — Olympia London
If you’ll be there, come say hello, and if you show this newsletter, we’ll even give you a bottle of sauce. Secret password: IReadTheNewsletterUntilTheEnd.
About Fuzzy Labs
We’re Fuzzy Labs, a Manchester-based MLOps consultancy founded in 2019. We’re engineers at heart, and nerds that are passionate about the power of open source.
And right now, we really are hiring. We’re growing fast — and we’re on the lookout for people to join our team.
Open roles:
If you, or someone you know, want to build serious systems with people who can happily spend 30 minutes arguing about observability and espresso extraction, we’d love to hear from you.
Not subscribed yet? You probably should be. The next issue will be our MLOps.WTF meetup playback and then, after that we’ll be diving deeper into agents in production, starting with multi-agent systems. Or follow us on LinkedIn to see what we’re up to🫶.

