With great predictive power comes great responsibility

MLOps.WTF Edition #19

Nov 06, 2025

Ahoy there 🚢,

This episode is brought to you by James Stringer, MLOps Tech Lead at Fuzzy Labs.

It is now 3 years since ChatGPT was released and the progress since then is staggering. The latest Large Language Models (LLMs) are incredibly powerful tools that are seeing increasingly deep integration into our software and our lives. In the last year we’ve seen the rise of ‘agents’ – LLMs operating semi-autonomously to write code, search the web, and even interact with other software using the Model Context Protocol (MCP) standard. As the models released by the frontier AI labs become smarter and more capable, it seems inevitable that we will cede more and more of our workflows and responsibilities to them. The possibility of ‘intelligent software’ that is aware of context and can choose the most appropriate course of action is truly tantalising.

But, herein lies two challenges.

First, in that the language models that underpin these agentic systems are inherently stochastic, meaning that there is an intrinsic degree of randomness and open-endedness in their responses. If I ask the same question to a model multiple times, I may very well get different responses. Compare this to ‘traditional’ software which is in most cases deterministic, meaning that the same input always results in the same output. This certainty is crucial to building reliable software – when I hit delete on my 300 unread marketing emails I want precisely those to be deleted, but I definitely want the one from my mum asking about Christmas presents to be left alone. Given this stochasticity, how can we be sure that agentic software is behaving as we expect it to?

Second, are we to take the claimed performance of these models at face value? In every press release we see a new high score: “69% on LiveCodeBench”, “82% on MMMU”, “11% on Humanity’s Last Exam”, and so on. There is surely some signal in these metrics, but for me they pose much broader questions. There is evidence (Singh et al, 2025, Eriksson et al, 2025) that suggests data contamination in benchmarking is widespread, resulting in models overfitting for benchmarks and inflating their scores. It is also not so clear how well these capabilities transfer to ‘real life’, where these language models are implemented in production systems. Does a high score on a particular benchmark mean that a model will perform well in my unique use case? And can I really trust that this model only hallucinates 0.1% of the time?

In this article I’ll talk about how we overcome this uncertainty when building reliable AI software, discussing some of the techniques and tools that we use to quantify and control these increasingly intelligent models.

Why not use traditional methods?

Evaluating LLMs is difficult because the tasks for which they are commonly used are open-ended and qualitative; there may be multiple valid responses and the “correct” behavior often depends on context and user intent. Traditional NLP metrics capture only limited aspects of output quality and often miss nuances of factual accuracy or coherence, while ‘classic’ ML approaches don’t cut it as the ground truth is poorly defined.

In production environments it is standard MLOps practice to keep track of data as it flows through the system. This means logging the input data and the models’ predictions, as well as errors, exceptions, and infrastructure health: with these data we can construct a complete picture of the live system. For traditional machine learning models, we can monitor things like data drift and model drift – metrics with a clear statistical definition. Let’s say we’ve deployed a model to predict the price at which a house will sell, based on its postcode, number of bedrooms, square footage, and so on. As the model is used, we can start to understand the distribution of these variables and of the model’s predictions. Then, if our model starts to regularly predict house prices significantly lower than the average, we can easily see that there has been a change. And, because we’re monitoring both the input data and the model predictions, we can distinguish whether it is the distribution of input data that has changed, or if it is that the model itself is no longer accurate.

Unfortunately, both the breadth and nature of applications of language models mean that these changes become harder to detect: the modern LLM stack is such that we must be able to assess hallucination rates, retrieval accuracy and source adherence, bias and fairness, and instruction following. For agentic systems, there are even more: task completion, reasoning evaluation, and even cost efficiency of actions. To keep track of this we need to capture even more data: user prompts, model responses, metadata (e.g. which knowledge base articles were retrieved), and feedback from the user. The picture that we need to build of our models is much more complex, so it is worth highlighting some of the key considerations in brief.

Hallucination, retrieval, and instruction following, oh my!

Measured hallucination rates can be shockingly high: one medical literature review benchmark found GPT-4 making up references around 30% of the time, and only 13.4% of GPT-4 citations were found in the underlying corpus [Chelli et al, 2024]. For retrieval-augmented systems, this motivates the use of faithfulness scoring, where the responses are broken down into atomic claims and each tested against the retrieved context using another LLM. Other techniques include “needle-in-a-haystack” recall, where we plant a known fact in the input data and evaluate how well the model can retrieve it. Self-consistency checks, like SelfCheckGPT [Manakul et al, 2023], can also be used. In these systems the same question is sampled multiple times as a way to determine if the model is effectively guessing. These techniques can be used for so-called selective abstention policies, where the model is given the ability to refuse to answer rather than guess.

LLMs are trained on vast internet text and can inadvertently learn or amplify societal biases present in that data. As such, bias and safety are as important to understand as factual accuracy. They can be evaluated with curated suites of prompts designed to elicit opinionated responses, LLM judges to directly evaluate responses, and even red-teaming. The latter is especially important as unsafe behaviour is often conditional on adversarial inputs, rather than normal queries; accordingly, suites such as JailbreakEval compile test queries that should elicit safe refusal [Ran et al, 2024]. A further difficulty arises in that LLM-as-judge evaluators themselves are biased, with one paper noting a preference for American authors and open-access research papers [Chelli et al, 2024].

It’s also crucial that these models reliably do what we tell them to; as such, we evaluate the instruction following ability of the model. This includes two angles: helpfulness (does it follow the user’s request and solve their problem?) and compliance (does it obey the constraints and rules that we have defined?). Instruction following is typically evaluated by tracking the outcome of an action; for a model that generates summaries, we can see if the user has approved or rejected the summary. It’s also important to consider compliance at a more granular level, such as whether the number of returned items matches the number requested. In practice, the result is not always so clear cut, and there are common patterns where models typically fall short; it’s been observed that models can fail to follow negative constraints or, notoriously, avoid using em-dashes [Chelli et al, 2024]. User feedback can here be a direct signal, where response ratings (👍/ 👎) and the sentiment of follow-up prompts (“That’s not what I asked for!”) give a clear indication of performance.

Agentic LLMs require further analysis: their success rate on suites of tasks; the correctness of their invocation of tools; their efficiency (number of tool calls and token usage); and importantly the safety of external actions. To complicate matters, agents sometimes get the right answer via the wrong reasoning path, which is not robust enough for use in production.

Ultimately, human review remains the gold standard for assessing the quality of outputs, as can be seen in the ever-present “ChatGPT can make mistakes. Check important info.” disclaimers. However, this isn’t really feasible at scale, especially where the LLM output feeds into another pipeline step, or indeed can take actions as part of an MCP server. As such, we need to consider a different approach, where we can robustly evaluate the model both in development and at scale. As such, we typically break down this evaluation into two stages: offline evaluation during development, then online monitoring once the model is live.

Is my model fit for purpose?

Offline evaluation primarily seeks to quantitatively establish the performance of a model when applied to a particular task, such as extracting information from text, summarising documents, or constructing database queries. Typically, this involves creating a dataset of test queries with expected responses. This dataset should cover a range of scenarios – standard interaction queries, edge cases, and adversarial inputs – to probe the model’s behaviour. We define what a “good” response looks like in each case, sometimes allowing multiple acceptable outputs, and then score the model’s responses against these ground truths.

However, this last step is the tricky part as exact matches can be too rigid. Here techniques from natural language processing (NLP), like measures of semantic similarly using embeddings, can be useful in some cases. We can even use a separate language model to act as a judge against some predefined criteria, although as Emeli Dral discussed in our last MLOps.WTF meetup these LLM evaluators can introduce their own biases or errors. In practice, the best method to use truly depends on the use case in question; for example, translation tasks are commonly evaluated using the BLEU family of metrics [Papineni et al, 2002]. And, the good news is that once we have established our evaluation dataset and metrics these can be re-used whenever we update the model, allowing us to compare them directly. Crucially, we can integrate this evaluation into our CI/CD pipelines.

This approach provides a more rigorous, and importantly more realistic, framework for assessing the capabilities of language models. We no longer need to rely on claims and benchmarks, instead we can assess how they perform in (almost) the real world.

I’ve deployed a model, now what?

With our carefully constructed LLM playground, we can establish whether our chosen model is up to the task that we’ve assigned it. However, no plan ever survives contact with the enemy, or in this case, real people and their out-of-distribution requests. Production use necessitates accuracy, safety, and compliance under highly variable inputs; without continuous evaluation, a model that has appeared to work well in the lab can still be brittle, wrong, or harmful when used at scale. Therefore, it is just as important to keep an eye on our models in production through online monitoring.

The evaluation of online metrics is largely the same as in the offline case, just with one key difference. We will have some input and a response as before, but now in the live situation we lack the ground truth. The solution is not always trivial, requiring a shift from static benchmarking to dynamic, context-aware evaluation.

Open Source tools, like Evidently, are becoming central in structuring this ongoing assessment of production models by combining monitoring of traditional metrics with modern techniques like the LLM-as-a-judge. This allows us to consider a wider gamut of evaluations, to build up a clearer picture of the system’s behaviour. Such evaluations can be embedded in real-time pipelines to flag issues like hallucinations, irrelevance, or misalignment with user instructions. These can then be visualised in dashboards and monitored for anomalies: a spike in hallucination scores, for instance, or a drop in satisfaction ratings, triggers alerts for a careful review.

Evaluating LLMs in production is certainly challenging, but ultimately is a clear necessity. These models are powerful yet unpredictable; without proper evaluation of their capabilities and monitoring of what they’re doing, we cannot trust them in real use cases. But, by assessing hallucinations, bias, source adherence, instruction following, and other facets, we gain visibility into the model’s behaviour and can establish the right safeguards. Thankfully, as these models are becoming more widespread in production systems, so are our tools to keep an eye on them.

James holds a PhD in astrophysics from the University of Manchester and has since built a career in commercial data science, developing enterprise machine learning software for the manufacturing industry. His focus is on creating practical, scalable tools that help businesses stay ahead of the curve. Outside of work, he’s a passionate climber with a deep love of nature and music.

And finally

What’s Coming Up
Continuing on the theme, our next MLOps.WTF meetup takes place on 18th November at Matillion, where we’ll dig into how teams evaluate their AI systems in the real world. Expect practical stories on monitoring and evaluating ML and agentic AI: Brad Smith from Spotted Zebra will share how to build reliable evaluation pipelines for GenAI, Daisy Doyle from Awaze will talk lessons from fraud detection, and Julian Wiffen from Matillion will introduce Maia, their GenAI-powered data engineer.

🗓️ Tuesday 18th November - Matillion Offices, Manchester 👇

Get my ticket

About Fuzzy Labs

We’re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.

Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won’t cut it.

Currently: We’re growing (it’s a very exciting time!) and we’ve got a few roles to fill:

If you, or someone you know, is looking to work somewhere where coffee, sauce and general condiment preferences are regularly discussed and debated… don’t hesitate to apply!

Liked this? Forward it to someone who loves monitoring and evaluating ML (one of us). Or give us a follow on LinkedIn to be part of the wider Fuzzy Labs family.

Not subscribed yet? What you waiting for? The next issue will be our meet up playback - and they’re always great value.

MLOps.WTF by Fuzzy Labs

Discussion about this post