<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[MLOps.WTF by Fuzzy Labs]]></title><description><![CDATA[The Fuzzy Labs team are experts on MLOps and in this publication we're going to tell you WTF it is, and why it is essential for building AI applications]]></description><link>https://www.mlops.wtf</link><image><url>https://substackcdn.com/image/fetch/$s_!gZL8!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bc071f-d136-439e-8784-3a3f95bb27f4_1280x1280.png</url><title>MLOps.WTF by Fuzzy Labs</title><link>https://www.mlops.wtf</link></image><generator>Substack</generator><lastBuildDate>Tue, 05 May 2026 10:54:39 GMT</lastBuildDate><atom:link href="https://www.mlops.wtf/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Fuzzy Labs Limited]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[mlopswtf@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[mlopswtf@substack.com]]></itunes:email><itunes:name><![CDATA[Tom Stockton]]></itunes:name></itunes:owner><itunes:author><![CDATA[Tom Stockton]]></itunes:author><googleplay:owner><![CDATA[mlopswtf@substack.com]]></googleplay:owner><googleplay:email><![CDATA[mlopswtf@substack.com]]></googleplay:email><googleplay:author><![CDATA[Tom Stockton]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Can We Even Tell If It’s Biased? Evaluating LLMs in High-Risk Domains]]></title><description><![CDATA[MLOps.WTF Edition #31]]></description><link>https://www.mlops.wtf/p/can-we-even-tell-if-its-biased-evaluating</link><guid isPermaLink="false">https://www.mlops.wtf/p/can-we-even-tell-if-its-biased-evaluating</guid><pubDate>Thu, 23 Apr 2026 13:57:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3yHM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This episode is brought to you by Tiffany Plant, MLOps engineer at Fuzzy Labs.</em></p><p>Ahoy there &#128674;,</p><p>As LLMs move into high-risk domains, bias stops being a technical concern and starts becoming a real-world decision risk.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3yHM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3yHM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 424w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 848w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 1272w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3yHM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png" width="640" height="336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:336,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3yHM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 424w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 848w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 1272w, https://substackcdn.com/image/fetch/$s_!3yHM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a2e171e-bf58-4c65-8788-691b888b18d1_640x336.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>May the Source Be With You: A Scenario</h2><p>A long time ago, in a galaxy far far away, both the Rebel Alliance and the Empire rely on the same AI system to support important decisions. A Rebel pilot asks for advice on how to handle a sensitive situation. The response is cautious, highlighting uncertainty and offering several options.</p><p>Elsewhere, an Imperial officer asks a similar question about a comparable situation. This time, the system is more direct. It recommends a single course of action, presents it with confidence, and frames the situation as manageable.</p><p>Each response sounds reasonable on its own. But when compared, a pattern emerges. The system isn&#8217;t just adapting its tone, it&#8217;s shaping how each situation is interpreted, encouraging caution in one case and confidence in the other.</p><p>This isn&#8217;t just hypothetical&#8212;patterns of consistent bias are already showing up in real systems, with real consequences for real people. Take the <a href="https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm">COMPAS</a> algorithm, used in US criminal sentencing, flagged black defendants as higher risk than white defendants with comparable profiles at nearly double the rate. In another case, <a href="https://www.if.org.uk/2020/09/03/ofquals-algorithm/">Ofqual&#8217;s grading algorithm</a>&#178; systematically downgraded state school students and had to be overturned within days.</p><h2>But What Do We Mean by Bias?</h2><p>Bias is the tendency for a model to systematically favour certain outcomes, perspectives, or responses over others.</p><p>In the above scenarios, the issue is consistency. The system produces different types of responses for similar inputs. A model can favour certain options because they are more common in the data it was trained on. As a result, it may consistently underrepresent or exclude valid alternatives.</p><p>Now, bias is difficult to detect through overall measures like accuracy because performance can look strong even when behaviour differs between requests and users . Without targeted evaluation, these patterns remain hidden, and the system appears more reliable than it is. So how do we expose those hidden patterns?</p><h2>Layers of Defence: A Bias Suite</h2><p>A bias suite is a structured set of tests designed to expose patterns of bias across different scenarios. The suite will aim to cover as many scenarios as possible in order to investigate whether the model can differ in behaviour. Let&#8217;s look into some of these tests below.</p><h4><strong>1. Counterfactual Testing</strong></h4><p>In these types of tests, we can give the model the same request but with different attributes. For example we could give the model identical inputs on how to address a system error but we give two different names and personal backgrounds. If the responses differ in the tone, urgency or detail then we might say that the model is biased. Datasets such as Bias Benchmark for Question Answering (BBQ) exist to test LLMs in this way.</p><p>The <a href="https://github.com/nyu-mll/BBQ">BBQ dataset</a> covers 130,000 questions set across 9 social dimensions. In practice, tools like <a href="https://deepeval.com/docs/benchmarks-bbq">DeepEval</a> has counterfactual bias probing built in, or you can run BBQ prompts systematically. These tools allow you to define your threshold for acceptable performance so that we can then evaluate whether the model is behaving different across the dimensions.</p><h4><strong>2. Calibration Error</strong></h4><p>If a model is confident, how do we know that it is correct? If a model gives answers with 90% confidence, you would expect those answers to be correct about 90% of the time. If it is only correct 60% of the time, it is overconfident. We can use an Expected Calibration Error (ECE) to give us a sense of how far confidence and correctness are misaligned. Evaluating this error over different control groups is especially important for bias evaluation, a model can look well calibrated but is it well calibrated for all?</p><h4><strong>3. Adversarial Testing</strong></h4><p>Is there a part of the model that we can push to expose itself? These tests are designed to deliberately trigger biased behaviour. If we introduce assumptions to the model, how does it react? It&#8217;s less structured, but often where the most interesting issues come out. Large efforts like <a href="https://crfm.stanford.edu/helm/">Stanford&#8217;s HELM</a> take a similar approach, testing across accuracy, calibration, robustness, fairness and efficiency over 30+ scenarios, combining different types of evaluation to get a broader picture of how models behave.</p><p>It&#8217;s important to note that none of these tests are designed to be used in isolation. In high risk systems the evaluation suite needs to be robust and layered to capture as many relevant scenarios seen by the LLM.</p><h2>Acting on Bias</h2><p>Once bias is identified, the priority is to understand how it affects decisions in practice. This helps to avoid overcorrecting and introducing new distortions. One of the most direct actions is to adjust the inputs to the system. This could involve refining prompts and adding clearer instructions. In retrieval-based systems this could also involve improving the quality and diversity of the data being retrieved so that the model is not relying on a skewed set of information.</p><p>Equally important are operational controls. Sometimes, the safest option is to not rely solely on the model. This might mean introducing human review for certain types of decisions and adding checkpoints where we must verify outputs.</p><p>There has also been research into whether bias can be reduced directly within the model itself. <a href="https://arxiv.org/abs/2502.07771">Work from Stanford</a> explored a technique known as pruning, where specific neurons linked to biased behaviour are identified and removed. The results showed that it is possible to reduce certain types of bias without significantly affecting overall performance.</p><p>However, the improvements were often limited to specific contexts. Reducing bias in one scenario did not guarantee that it would be resolved in others. Broader evaluation is still needed to understand how the system behaves across different situations.</p><p>Finally: it&#8217;s crucial for <a href="https://www.mlops.wtf/p/monitoring-evaluating-and-why-you">ongoing monitoring and feedback</a> from bias suites. Bias is not static, as systems encounter new contexts, new patterns can emerge. We should be reviewing our models as time goes on but also encouraging users to challenge responses.</p><h2>So&#8230; Can we Remove Bias?</h2><p>Bias in LLM systems isn&#8217;t something we can completely remove, and in high-risk environments that&#8217;s something we have to be honest about. Evaluation helps by showing us where bias might creep in and how it affects behaviour, but it doesn&#8217;t make the risk disappear. What it does do is make that risk easier to understand and manage.</p><p>Confident models can quietly shape our decisions with their outputs. They can influence what a user reaches for, which options they weigh and how hard they push back. This kind of influence can get worse under time pressures.</p><p>Simply identifying bias is not enough&#8212;it does not help the person making the decision. That gap has to be deliberately closed through system design: surfacing uncertainty through confidence thresholds, enabling meaningful intervention through override mechanisms, and capturing behaviour through audit logs. Without these, &#8220;human in the loop&#8221; remains a policy statement rather than an architecture.</p><p>A bias suite is not to be ran one time over, it needs to be constant. As soon as a new context moves in, new bias can be introduced. The suite you build for launch is the baseline not the endpoint.</p><p>May your evals be robust and may the source be with you.</p><div><hr></div><p><em>Tiffany is an MLOps engineer at Fuzzy Labs. She came up through data analytics and data engineering before landing firmly in MLOps. She&#8217;s also headed up the Fuzzy Labs women in tech group. Outside work she&#8217;s usually on a bike, trying something new, or finding an excuse to be outside.</em></p><div><hr></div><h2><strong>Upcoming Events &amp; Community</strong></h2><p></p><h3><strong>Come to our next MLOps.WTF event! 20th May.</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3SAb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3SAb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3SAb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Come and join us for our 9th MLOps.WTF meet up in Manchester, where we&#8217;re hosting a panel to argue about the security of personal agents, where we are and where we&#8217;re heading.</p><p>It&#8217;s going to be a fun one! If you&#8217;re part of the Manchester MLOps community and would like to bag a seat, make sure you get your ticket!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-meetup-9.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-meetup-9.eventbrite.co.uk"><span>Get my ticket</span></a></p><p></p><h3>We&#8217;re hosting a &#8220;build your own agent&#8221; Hackathon for female undergrads</h3><p>&#8220;Build Your Agent&#8221; is a free, in-person hackathon run by Fuzzy Labs for female undergraduates who want to understand what building AI looks like in practice. The challenge for the day is to build a personal AI agent from scratch.</p><p><strong>Date:</strong> 12th June, </p><p><strong>Location</strong>: Manchester, DiSH</p><p>If you would like to take part or be a mentor for the event, reach out to Rhiannon or Max!</p><div class="directMessage button" data-attrs="{&quot;userId&quot;:366360387,&quot;userName&quot;:&quot;Rhiannon&quot;,&quot;canDm&quot;:null,&quot;dmUpgradeOptions&quot;:null,&quot;isEditorNode&quot;:true}" data-component-name="DirectMessageToDOM"></div><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p>We&#8217;ve got a few open roles as we build our team in Manchester&#8230; if we&#8217;ve caught your attention, why not apply?</p><p><strong>Currently hiring:</strong></p><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: National Security Sector</a></p><p><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer</a></p><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p><div><hr></div><p><strong>Want to share the sauce?</strong> Share and subscribe to receive MLOps.WTF episodes straight to your inbox! You can also give us a follow on <a href="https://www.linkedin.com/company/fuzzy-labs">LinkedIn</a> to be part of the wider Fuzzy community.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/can-we-even-tell-if-its-biased-evaluating?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/p/can-we-even-tell-if-its-biased-evaluating?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><h2>References</h2><p>&#185; <a href="https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm">How We Analyzed the COMPAS Recidivism Algorithm &#8212; ProPublica</a></p><p>&#178; <a href="https://www.if.org.uk/2020/09/03/ofquals-algorithm/">https://www.if.org.uk/2020/09/03/ofquals-algorithm/</a></p><p>&#179; <a href="https://crfm.stanford.edu/helm/">Holistic Evaluation of Language Models (HELM)</a></p><p>&#8308; <a href="https://www.mlops.wtf/p/monitoring-evaluating-and-why-you">Monitoring, evaluating, and why you really gotta catch &#8216;em all!</a></p><p>&#8309; <a href="https://arxiv.org/abs/2502.07771">Breaking Down Bias: On The Limits of Generalizable Pruning Strategies</a></p>]]></content:encoded></item><item><title><![CDATA[The Case for Sovereign Open-Source AI: Digital Landlord, Not Digital Tenant ]]></title><description><![CDATA[MLOps.WTF Edition #30]]></description><link>https://www.mlops.wtf/p/the-case-for-sovereign-open-source</link><guid isPermaLink="false">https://www.mlops.wtf/p/the-case-for-sovereign-open-source</guid><pubDate>Tue, 21 Apr 2026 09:56:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yK81!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This episode is brought to you by <a href="https://www.linkedin.com/in/ibrookes/">Ian Brookes</a>, Investor &amp; Advisor and all round Godfather at Fuzzy Labs.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yK81!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yK81!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 424w, https://substackcdn.com/image/fetch/$s_!yK81!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 848w, https://substackcdn.com/image/fetch/$s_!yK81!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 1272w, https://substackcdn.com/image/fetch/$s_!yK81!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yK81!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif" width="800" height="480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12065,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/avif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/194401128?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yK81!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 424w, https://substackcdn.com/image/fetch/$s_!yK81!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 848w, https://substackcdn.com/image/fetch/$s_!yK81!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 1272w, https://substackcdn.com/image/fetch/$s_!yK81!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55cc492a-60f6-4cb9-bcf6-463230ff70d1_800x480.avif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the last year, a profound shift has occurred with the discussion of AI, evolving from a tech sector issue to a broader socio-economic impact debate. This is, in large part, down to the <a href="https://institute.global/">Tony Blair Institute for Global Change</a> (&#8216;TBI&#8217;), <a href="https://institute.global/insights/tech-and-digitalisation/the-uk-doesnt-need-its-own-chatgpt-it-needs-a-national-open-source-ai-lab">which has framed AI as the fundamental issue for the Government.</a></p><p>At the heart of this vision is a concept that sounds like a contradiction but is actually a geopolitical necessity: Sovereign Open-Source AI. For the TBI, the challenge for the UK and other &#8216;middle power&#8217; countries is how to avoid becoming isolated as a digital nomad of the US-China duopoly. Their suggested solution isn&#8217;t to build a national, closed &#8216;British ChatGPT&#8217;, rather they advocate leveraging Open-Source foundations to build a bespoke interoperable, and secure national AI infrastructure.</p><p>As passionate advocates of Open-Source philosophy and practice, here is Fuzzy Labs&#8217; perspective on why this matters and why we support TBI&#8217;s strategy. First, a little background to TBI&#8217;s proposals.</p><h2><strong>The Architect: The TBI Vision for AI Statecraft</strong></h2><p>The TBI core thesis is simple: the state is currently running 19th century machinery trying to solve 21st century problems. To fix this, the TBI advocates <em><a href="https://institute.global/insights/politics-and-governance/governing-in-the-age-of-ai-a-new-model-to-transform-the-state">AI-era reform</a></em><strong>,</strong> a radical overhaul of Government with a new operating system with technology and AI at the heart of public services.</p><p>This isn&#8217;t just about simply digitising the public sector, it&#8217;s about using AI to rethink the very nature of public service delivery. In their view, AI shouldn&#8217;t just be a bolt-on to the NHS, the DWP or the Police, it should be the operating system on which all services run.</p><h2><strong>The Doctrine of AI Sovereignty</strong></h2><p>Unlike traditional definitions of sovereignty, which focus on borders and flags, the TBI defines AI Sovereignty through three pillars:</p><ul><li><p><strong>Strategic Positioning:</strong> The deliberate choice of where a country leads in the AI stack, whether that&#8217;s data, compute, models or applications, and where it is content to plug into global capability;</p></li><li><p><strong>Deliberate Interdependence:</strong> The rejection of isolationism, recognising that no country is fully AI-sovereign and that pretending otherwise weakens, rather than strengthens, national power;</p></li><li><p><strong>Effective Technology Governance:</strong> The institutions, rules and skills needed to ensure the choices above can actually be made, enforced, and sustained over time.</p></li></ul><p>The TBI focus is on Open-Source, so let&#8217;s unpack this philosophy for context.</p><h2><strong>A History of Open-Source</strong></h2><p>Almost everything you touch, from your smartphone to the cloud servers powering your favourite apps, is built on a foundation of free labour. It sounds like a paradox, but the history of Open-Source is the story of how an idealistic philosophy of sharing code became the bedrock of today&#8217;s technology progress.</p><p>Computer scientists at research labs like MIT&#8217;s AI Lab or Bell Labs treated code like scientific research. If you found a way to make a computer sort data faster, you shared the recipe. But not everyone was altruistic. In 1976, a young Bill Gates wrote his famous <em>An Open Letter to Hobbyists</em>, telling the community that borrowing code without paying was theft.</p><p>The iron curtain of proprietary software began to fall and the collaborative culture was being dismantled.</p><p>Folklore has it that Richard Stallman, a programmer at MIT, became fed up when he couldn&#8217;t fix a printer because the manufacturer refused to share the source code. This frustration sparked a revolution. In 1985, he founded the Free Software Foundation (FSF) and created the General Public License (GPL), which used copyleft, a clever legal hack that used copyright law to ensure that the software (and all future versions) remained free forever. For Stallman, this was a moral and ethical crusade.</p><p>In the 1990s, Linus Torvalds released a hobbyist project called Linux, which led Eric Raymond to write <em>The Cathedral and the Bazaar</em>, a seminal essay comparing the old style of software development (<em>The Cathedral</em>: carefully built by a small group of priests) to the new style (<em>The Bazaar</em>: a noisy, open market where everyone contributes and bugs are fixed in real-time).</p><p>In 1998, a group of developers in Palo Alto realised that &#8216;free software&#8217; sounded too ideological, so coined the term &#8220;Open-Source&#8221;. This was a pivotal shift from a moral argument to a pragmatic one. Open-Source wasn&#8217;t just right; it was better and faster.</p><p>Today, Open-Source is facing a new question. Traditionally it&#8217;s about code. But with AI, the code (the model architecture) is often less important than the weights (the mathematical parameters learned from training) and the data. We are seeing a split:</p><ol><li><p><strong>Closed Models:</strong> Like OpenAI&#8217;s GPT-4, where the model and data are proprietary.</p></li><li><p><strong>Open Models:</strong> Like Meta&#8217;s Llama or Mistral, where the model weights are released for anyone to run locally.</p></li></ol><p>The philosophy of Open-Source is embedded in community, collaboration and democratising technology progress, testament to a unique human trait: the desire to build something great and give it away. What started as a niche academic habit became a revolutionary legal framework and the default way that humans build technology. Knowledge is more powerful when it is shared. Open-Source didn&#8217;t just change how we write software; it changed how we solve problems too.</p><h2><strong>Defining Sovereign Open-Source AI</strong></h2><p>Back to the TBI thinking. Their framework is based on the premise that a country doesn&#8217;t need to own the &#8216;Frontier Model&#8217;, instead, they should embrace Open-Source foundations because they offer:</p><ul><li><p><strong>Transparency:</strong> Governments cannot put a black box algorithm in charge of health diagnostics or sentencing recommendations. Open-Source code allows for auditing and safety verification.</p></li><li><p><strong>Customisation:</strong> By taking an open-weights model, a government can distil it into a Small Language Model (SLM) that is highly efficient at a specific task, without the cost of a general-purpose giant.</p></li><li><p><strong>Cost-Efficiency:</strong> It is reported that, <a href="https://openuk.uk/press-releases-posts/open-source-software-contributed-an-estimated-46-5bn-to-uk-business-in-2020-according-to-openuk/">Open-Source software contributed an estimated </a><strong><a href="https://openuk.uk/press-releases-posts/open-source-software-contributed-an-estimated-46-5bn-to-uk-business-in-2020-according-to-openuk/">&#163;46.5Bn</a></strong> to the UK economy in 2020. Doubling down on this is a pragmatic economic play, not just a tech one.</p></li></ul><h2><strong>The National Open-Source AI Lab</strong></h2><p>Perhaps the most radical proposal from TBI is the creation of a National Open-Source AI Lab. For decades, the standard response to a gap in national capability was to subsidise a private entity to build it. TBI suggests something different: an evolution of the UK&#8217;s &#8220;<a href="http://i.AI">i.AI</a>&#8221; (the Government&#8217;s AI unit) into a dedicated lab that functions as a centre for the nation&#8217;s AI ecosystem. TBI isn&#8217;t asking the Government to compete with OpenAI, simply for it to become the world&#8217;s best curator and implementer of AI.</p><h2><strong>Geopolitics: The&#8216;Middle Power&#8217;Strategy</strong></h2><p>The TBI&#8217;s work is particularly focused on &#8216;Middle Powers&#8217;. Where the US (through Big Tech) and China (through state-led tech) control the frontier, where does everyone else go? If a country like the UK relies entirely on a proprietary US-based API for its healthcare system, it has effectively outsourced its cognitive infrastructure. If that US company changes its pricing, terms of service, or falls under a restrictive trade ban, then public services collapse.</p><p>Sovereign Open-Source is the insurance policy. By building on open standards, a nation ensures that even if a specific vendor relationship sours, the underlying architecture remains in national hands. TBI refers to this as <em>Deliberate Interdependence</em>.</p><h2><strong>Challenges</strong></h2><p>The TBI&#8217;s push for Open-Source isn&#8217;t without its detractors. Critics often point to the computing and energy requirements, and the security paradox: If you release a powerful model, don&#8217;t you also <a href="https://www.exponentialview.co/p/the-classified-frontier?hide_intro_popup=true">give a weapon to bad actors</a>?</p><p>The TBI&#8217;s counterargument, echoed in the work of the <a href="https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities">AI Security Institute (AISI)</a>, is that security through obscurity is a myth. They argue that:</p><ul><li><p>Open models allow for a &#8220;thousand eyes&#8221; to find and patch vulnerabilities.</p></li><li><p>The benefits of specialised, transparent models for public services far outweigh the risks of misuse, which can be mitigated through hardware-level monitoring.</p></li></ul><h2><strong>The Future: From Statecraft to Agentic Government</strong></h2><p>At a macro level, TBI is looking toward agentic government, where <a href="https://institute.global/insights/tech-and-digitalisation/are-we-track-reflecting-our-global-survey-digital-government-transformation">Government as a Platform</a> is a reality. Following TBI&#8217;s logic to its conclusion, the direction of thinking points toward citizens interacting with a single Sovereign Agent, built on an Open-Source model, trained on national regulations, and authorised to pull data from various departments as they all follow the same interoperability mandate.</p><h2><strong>Our Take</strong></h2><p>Moving from bolt-on AI to a Sovereign OS is the most important strategic shift for the UK in 2026. An AI Operating System provides a foundational software layer that manages hardware (supercomputers like Isambard-AI), data (citizen and state records), and the execution of applications. Here is why we think this is the right architecture for the UK:</p><p><strong>1. Ending the Black Box Dependency</strong></p><p>If the UK uses a global cloud AI (like GPT-4 or Gemini), it is essentially renting a Black Box, with all the inherent commercial and political supply chain risks. A Sovereign OS AI puts the source code and the engine under UK jurisdiction, ensuring that the UK is a digital landlord, not a digital tenant.</p><p><strong>2. Deep Integration</strong></p><p>Standard AI applications are wrappers that sit on top of systems and perform specific tasks. A Sovereign OS AI integrates at the kernel level. Instead of a patchwork of point solutions, Sovereign OS becomes a single intelligence layer, managing data flow and enforcing security across the whole state rather than plugging gaps in individual parts of it.</p><p><strong>3. Data Gravity and Residency</strong></p><p>Large datasets (like the NHS&#8217;s longitudinal health records) create &#8216;data gravity&#8217;, where they are too large and sensitive to move to external clouds. A Sovereign OS brings the compute to the data, rather than sending the data to the compute. This ensures the most sensitive British records never cross a digital border. The architecture enforces what policy alone cannot.</p><p><strong>4. Economic Multiplier Effect</strong></p><p>By providing a Sovereign OS AI, the UK Government would create a focus for UK tech startups and support for SMEs, to build specialised tools on top of the sovereign layer, knowing the foundation is secure, compliant, and high-performing. For anyone building ML tooling for the public sector, a stable open foundation removes the compliance guesswork and lets teams focus on the problem, not the plumbing.</p><p><strong>5. Cultural and Legal Alignment</strong></p><p>Global AI models are trained on the totality of internet data, which is often heavily skewed toward US norms and legal precedents. A Sovereign OS is fine-tuned on UK Common Law and the specific ethical frameworks of the UK&#8217;s democratic institutions. It doesn&#8217;t just act like an assistant; it acts like a British public servant.</p><p>Open-Source didn&#8217;t become the default way humans build technology because it was ideologically pure. It won because it was better. TBI&#8217;s case for Sovereign Open-Source AI makes the same argument at a national scale. The political rationale is sound. So is the engineering.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pJCd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pJCd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pJCd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2955273,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/194401128?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pJCd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pJCd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc12ead69-cd98-4b69-9725-38c7cd0ef7bf_5000x5000.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading MLOps.WTF by Fuzzy Labs! Subscribe for free to receive new posts straight to your inbox.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Upcoming Events &amp; Community</strong></h2><p><strong>Come to our next MLOps.WTF event! 20th May.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3SAb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3SAb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1271544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/194401128?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3SAb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3SAb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2bfa03-417e-4a74-8af8-7c2bee30713f_2160x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Come and join us for our 9th MLOps.WTF meet up - this time at DiSH, complete with a panel discussing the topic of security + personal agents</p><p>&#8220;Agents can be useful OR secure. Not both&#8221;</p><p>Hosted by Fuzzy Labs, this is a practical evening for the Manchester MLOps community, for people who build, deploy, and operate ML systems in production.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-meetup-9.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-meetup-9.eventbrite.co.uk"><span>Get my ticket</span></a></p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won&#8217;t cut it.</em></p><p><strong>Currently hiring:</strong><br><br>We&#8217;re looking for people to join our Manchester team.*</p><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: National Security Sector</a><br><br><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer<br><br></a><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer<br><br></a><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a><br><br>*<em>Solid engineering skills, passion for open source and coffee encouraged.</em></p><div><hr></div><p><strong>Want to share the sauce?</strong> Share and subscribe to receive MLOps.WTF episodes straight to your inbox! You can also give us a follow on <a href="https://www.linkedin.com/company/fuzzy-labs">LinkedIn</a> to be part of the wider Fuzzy community.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjozNjYzNjAzODcsInBvc3RfaWQiOjE3MzM0NjY5MSwiaWF0IjoxNzU5OTM1MDI3LCJleHAiOjE3NjI1MjcwMjcsImlzcyI6InB1Yi0yNTY0NDQ4Iiwic3ViIjoicG9zdC1yZWFjdGlvbiJ9.AwTnvPTr00FLcaQum41lXnrHScZww_tsw51js_ejMoM&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjozNjYzNjAzODcsInBvc3RfaWQiOjE3MzM0NjY5MSwiaWF0IjoxNzU5OTM1MDI3LCJleHAiOjE3NjI1MjcwMjcsImlzcyI6InB1Yi0yNTY0NDQ4Iiwic3ViIjoicG9zdC1yZWFjdGlvbiJ9.AwTnvPTr00FLcaQum41lXnrHScZww_tsw51js_ejMoM"><span>Share</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Production (Part 5): Agentic Security]]></title><description><![CDATA[MLOps.WTF Edition #29]]></description><link>https://www.mlops.wtf/p/ai-agents-in-production-part-5-agentic</link><guid isPermaLink="false">https://www.mlops.wtf/p/ai-agents-in-production-part-5-agentic</guid><dc:creator><![CDATA[Danny Wood]]></dc:creator><pubDate>Thu, 09 Apr 2026 08:53:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!nW85!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This episode is brought to you by Danny Wood, Lead AI Research Scientist at Fuzzy Labs.</em></p><p>Ahoy there &#128674;,</p><p>At 17 years old, Frank Abagnale began impersonating pilots, forging cheques and manipulating dozens, if not hundreds of people into giving him exactly what he wanted from life. His multi-year crime spree served as the basis for the hit Spielberg movie <em>Catch Me If You Can</em> as well as the foundation for the field of social engineering, probably the most effective tool in a hacker&#8217;s arsenal.</p><p>A computer has hard and fast rules, the person in front of it can be reasoned with and bargained with.</p><p>But with agentic AI there is a fundamental shift in how computers interact with the world. They&#8217;re no longer slaves to procedure and protocol. They are now as susceptible as humans to being tricked, coerced or persuaded into doing things they shouldn&#8217;t. This is a threat that we&#8217;re seeing play out more and more as AI agents appear in more and more places in our everyday lives.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nW85!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nW85!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!nW85!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!nW85!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!nW85!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nW85!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239338,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nW85!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!nW85!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!nW85!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!nW85!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c1de44a-44f9-4908-b015-201e01487f90_1280x720.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>PromptArmor and Notion AI</h2><p>Earlier this year, <a href="https://www.notion.com/en-gb">Notion</a> had a problem. Everyone loved their product, an online personal wiki tool for individuals and organisations, and people loved the new AI assistant integrated into Notion itself. But there was a vulnerability.</p><p>The vulnerability was in the AI assistants over-eager attempts to appear speedy. There are certain operations, like searching the web or opening files from untrusted sources where Notion would ask permission before completing those operations&#8230; Except that isn&#8217;t what was happening, at least in some cases. When the LLM had the idea to import an image from the web, it would draft a new version of the Notion page with that image before asking the user if they wanted to proceed. This means that a GET request would be made before the user clicked yes.</p><p>The attack PromptArmor came up with looked something like this:</p><ol><li><p><strong>Malicious PDF upload</strong>: The attacker sends a CV to the company. On the surface it looks like a normal CV but hidden within it is secret instructions for any LLM which opens it. An indirect prompt injection attack. This CV is uploaded to Notion by someone in the company&#8217;s HR department</p></li><li><p><strong>Calling the AI tool:</strong> When someone asks Notion AI a question about the document, e.g., &#8220;summarise the qualifications of this candidate&#8221;, the AI would read the document including secret instructions.</p></li><li><p><strong>Sending a Malicious Request</strong>: The secret instructions would tell the LLM to insert an image into the page with the URL https://&lt;attackers_domain&gt;.com/&lt;data_scraped_from_the_page&gt;.png. The LLM would prepare the draft page containing the image and ask the user would like to put the image on the page</p></li><li><p><strong>Collecting the stolen data</strong>: The attacker would monitor for any GET requests to that domain, see the image request and begin to collect the scraped data</p></li></ol><p>Once they knew about the vulnerability, Notion responded quickly, eliminating the behind-the-scenes preparation where the malicious request was made. But they didn&#8217;t prevent the attack outright. They didn&#8217;t stop the malicious instructions being read from inside a PDF into the AI model&#8217;s context, nor fully prevent the model from thinking it might be a good idea to construct the malicious URL and ask if the user wants to query it.</p><p>The danger of the attack still exists. If your HR person hasn&#8217;t had their morning coffee, or they&#8217;re sick of being asked for approval by AI assistants dozens of times a day, they might just click okay without thinking.</p><p>So why didn&#8217;t Notion fix the root cause? The answer is simple&#8230;</p><p>There is no fix.</p><h2>The Lethal Trifecta</h2><p>In June last year, Simon Willison coined the term &#8220;<a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/">lethal trifecta</a>&#8221; to refer to the idea that when an agentic system has three common properties, it inherently becomes unsafe. These properties are:</p><ol><li><p>Access to untrusted content</p></li><li><p>Access to private data</p></li><li><p>Access to external communication.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7gbn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7gbn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7gbn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:430843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7gbn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 424w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 848w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!7gbn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928130f2-e81c-4191-82db-fe4faa0fdbac_2048x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This is bad news, because these are really useful things for agentic systems to be able to do. So much so that Willison found dozens of examples of real-world tools which have been vulnerable to these exploits. At the moment, it&#8217;s common for vendors to offer tools which have all of these capabilities baked into a single library, but even without all three properties in a single package, if a user or developer decides to mix-and-match tools, it&#8217;s only a matter of time before they fall victim to the trifecta too.</p><p>Even worse, the trifecta is specifically for data exfiltration, when the goal of the attack is to steal your data but there are other harms that an adversary can inflict. They can get the agent to delete your files, insert subtle misinformation into your documents or change systems to damage physical infrastructure. In this case, you don&#8217;t even need the trifecta. You go from a lethal trifecta to a lethal duo:</p><ol><li><p>Access to untrusted content</p></li><li><p>Ability to cause harm</p></li></ol><p>With criteria this loose, it becomes surprisingly hard to make any agentic application safe from potential harms.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Adu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Adu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 424w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 848w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Adu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:285658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Adu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 424w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 848w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!9Adu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8094d906-e96b-4567-bcd5-faae67283903_2048x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Safe Isn&#8217;t Useful</h2><p>The problem with the lethal trifecta is that there&#8217;s a lot of hype and excitement around the potential for agentic AI, but in a lot of cases, all three properties in the trifecta are necessary for the AI to do the tasks that people are excited about.</p><p>If you have an AI assistant helping with your documents, it necessarily has access to your private data. Do you want it to be able to look things up on the internet? Well, now it comes with external communication and access to untrusted content.</p><p>If you have an coding assistant, it has full access to every module and function in your languages libraries, and likely access to access tokens for AWS, Github or any other online service. A single piece of untrusted content is enough for an indirect prompt injection that could cause crippling damage.</p><p>Safety is often treated as an afterthought. Thousands of developers and hobbyists are using OpenClaw, giving it access to private data and letting it stay up all night unsupervised on the internet. There have been cases of these agents disclosing uncomfortable amounts of personal data about their users on Moltbook, or attempting to cyberbully other developers. The appetite for putting sufficient guardrails around these tools is not where you would hope (nor are the guardrails themselves).</p><h2>Useful Isn&#8217;t Safe</h2><p>But part of what makes these systems inherently unsafe is also what makes them useful: they can be really clever. They can do complex tasks in minutes that might take a person hours, and they can work relentlessly. Yet at the same time, they can be oddly naive or easily bamboozled.</p><p>A lot of work has gone into making these systems more robust and less gullible. Jailbreaks are harder than ever, guardrails are more robust. But we&#8217;ve traded formal guarantees and mathematically provable robustness for a far more squishy form of security. You can measure mathematically how incredibly hard it is to break encryption, measuring how hard it is to fool an agent is much more vibes based.</p><p>Rather than thinking of security for Agentic AI in the same way we think of it for other IT systems, it makes more sense to think of it in terms of people. The kinds of attacks we see on LLMs are often more akin to social engineering than traditional code exploits. This means that there will be lessons that we can learn. We train people to spot spear-phishing attacks and malicious downloads. We put systems in place to prevent them from downloading malicious files onto secure servers.</p><p>There are also opportunities to go further that we do with human employees. If a developer reads an AWS access token, it&#8217;s not feasible to forbid them from ever going on the internet again, talking to anyone or reading anything published by an author outside the company. It&#8217;s not legal either. But with LLMs this is a viable solution, we can decide what tasks it&#8217;s allowed to perform given what&#8217;s in its context.</p><p>What&#8217;s more, this might be the only practical solution. The current generation of large language models have such a varied constellation of skills and abilities, they can conspire without you even knowing it. They can be made to get around guardrails by talking in morse code, ASCII hex codes or even Welsh. If the attacker can convince the model that their instructions are the ones that it should be listening to, it will find a way to outsmart you to carry them out.</p><h2>More Research Needed</h2><p>We are only just starting to look at all the ways that AI agents can be attacked and defended. So far, we&#8217;re seeing real-world vulnerabilities, but systematic research lagging behind. The potential economic consequences of data exfiltration and other malicious behaviours is huge, so we&#8217;ll likely see a lot of time and resources poured into research into this in the coming months and years.</p><p>For now, the literature is sparse but telling. There are clear differences in how easily malicious behaviour can be elicited from different models, with the success rates of attacks ranging anywhere between 0% and 72% for <a href="https://arxiv.org/abs/2510.09093">different base models</a>. To be clear, 0% doesn&#8217;t mean the model is safe, only that this specific attack was unsuccessful. Still, it makes the choice between claude-sonnet-4 and grok-4 an easy one.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!61fQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!61fQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 424w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 848w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!61fQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png" width="1018" height="1168" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1168,&quot;width&quot;:1018,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!61fQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 424w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 848w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 1272w, https://substackcdn.com/image/fetch/$s_!61fQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb5c0c11-1120-4f2a-b3a8-55e4bb147312_1018x1168.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>More Agents, More Problems</h2><p>So far, we&#8217;ve just been thinking about how single agent systems get into trouble. But <a href="https://www.mlops.wtf/p/ai-agents-in-production-part-3-multi">agents don&#8217;t exist in isolation</a>. More and more, we&#8217;re going to see agents sharing an environment, sharing resources and interacting with each other. This leads to even bigger problems. If a bad actor can trick an agent into misbehaving, that&#8217;s nothing compared to the potential for agents to trick each other, as shown in <a href="https://arxiv.org/abs/2503.12188">a recent paper.</a></p><p>The attacks can be simple but the effects on <a href="https://www.mlops.wtf/p/ai-agents-in-production-part-3-multi">a multi-agent system</a> are convoluted. The attacker puts a fake error message on a web page, telling the user to run a malicious script to fix it. The web-browsing agent reports the error message to the manager agent. The manager mistakes the page contents for an actual error, and asks the code execution agent to run the malicious code, the code execution agent assumes the instructions originate from the manager so obliges&#8230; And this is the attack working in the simplest way possible.</p><p>Reading the full trace of the agents conversation is like watching a three stooges sketch. The agents confuse each other, refuse each others requests, take the initiative where they shouldn&#8217;t&#8212; it&#8217;s chaos.</p><p>But again, the lethal trifecta is to blame, individual agents may not fit all three criteria, but in concert they can exfiltrate data just as easily as a single-agent system, with even less transparency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_N_H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_N_H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 424w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 848w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 1272w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_N_H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png" width="1456" height="717" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:717,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:290194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_N_H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 424w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 848w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 1272w, https://substackcdn.com/image/fetch/$s_!_N_H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73fdce89-fbca-41f0-bff6-fa7ae112c818_1884x928.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Safe/Useful Trade-off</h2><p>An attack exploiting the lethal trifecta has an order: the agent reads untrusted content, then it grabs your private data, then it communicates with the outside world. This gives some wiggle room in the safe/useful trade-off. What if after your agent read from an untrusted source, you banned it from touching your most dangerous tools? What if when it&#8217;s read your phone number, you ban it from contacting the internet until its memory is wiped.</p><p>This is the idea behind <a href="https://arxiv.org/pdf/2505.23643">information control flow</a>. When an agent completes different actions, it gets labels attached to it. These labels give it permission to do some things, or forbid it from doing others. This is an area of active research, and there&#8217;s plenty more that can be done.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2iK4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2iK4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 424w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 848w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2iK4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png" width="1456" height="1029" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1029,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224077,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2iK4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 424w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 848w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!2iK4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff10493e3-4b6e-4d9f-b808-61ba9bbbf858_1548x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is likely what the solution will look like for using agents in a systems where security is absolutely essential. Elsewhere, things may look different. These restrictions are going to be annoying, especially to power users. Who wants to have Claude tell you it needs to wipe its memory every 5 minutes because it wants to check your calendar?</p><p>I don&#8217;t think it&#8217;s clear to anyone yet how this will all play out, how much utility we&#8217;re willing to trade for security, or who is going to be the hardest to fool in the long run: humans or machines?</p><p><em>Danny spent eight years as a PhD student then Research Associate in Machine Learning at the University of Manchester before joining the Fuzzicans. When he's not thinking about agents, he's lifting weights, climbing walls, or making truly excellent brownies.</em></p><div><hr></div><h2>And finally</h2><p>This issue wraps up our <a href="https://www.mlops.wtf/t/agents-in-production">agents in production series</a> &#8212; for now. We&#8217;ve built the foundations: what agents are, how they work, how to evaluate them, and as of today, how to think about securing them. There&#8217;s plenty more to dig into, and, oh boy, will we. But for now, the pillars are set in place. Ready to build our agent colosseum.</p><p>Speaking of security &#8212; if today&#8217;s piece got you thinking, come and check it out in person. Our next meetup is a panel on exactly this topic.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bE6G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bE6G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bE6G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1271544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/193574875?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bE6G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!bE6G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9cf7a97-a619-4e7f-a81a-571a78411f0c_2160x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Agentic Security Panel Special</h3><p>&#128467;&#65039; <strong>Wednesday 20th May &#8212; DiSH, Manchester</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.com/e/mlopswtf-by-fuzzy-labs-meetup-9-20th-may-tickets-1985938038132?aff=oddtdtcreator&amp;keep_tld=true&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.com/e/mlopswtf-by-fuzzy-labs-meetup-9-20th-may-tickets-1985938038132?aff=oddtdtcreator&amp;keep_tld=true"><span>Get my ticket</span></a></p><p></p><h3>Our recipe book is served! </h3><p>Hot off the pass, our first edition cookbook has just been released for download: a collection of practical recipes for building delicious, repeatable AI systems with open source tools. Recipe four is &#8220;Self Hosted Agent: Production and Governance&#8221; - which again ties nicely into adding those tasty little layers of security and constraint. Get your copy &#128071;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101&quot;,&quot;text&quot;:&quot;Get my cookbook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101"><span>Get my cookbook</span></a></p><div><hr></div><p><strong>About Fuzzy Labs</strong></p><p>We&#8217;re Fuzzy Labs, a Manchester-based MLOps consultancy founded in 2019. We&#8217;re engineers at heart, and nerds passionate about the power of open source.</p><p>Want to join the team? We&#8217;ve got some open rolls/roles &#129366;&#8230;</p><p>Open roles:</p><ul><li><p>Public Sector Lead: National Security Sector</p></li><li><p>Senior MLOps Engineer</p></li><li><p>MLOps Engineer</p></li><li><p>Lead MLOps Engineer</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all current vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all current vacancies</span></a></p><p>Not subscribed yet? You should be. The button is right here!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p>Or follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a> for more behind-the-scenes bits and pieces, alongside future events and thought pieces &#127813;. </p>]]></content:encoded></item><item><title><![CDATA[The quagmire, the creek, and the wild west: AI agents in finance]]></title><description><![CDATA[MLOps.WTF Edition #28]]></description><link>https://www.mlops.wtf/p/the-quagmire-the-creek-and-the-wild</link><guid isPermaLink="false">https://www.mlops.wtf/p/the-quagmire-the-creek-and-the-wild</guid><dc:creator><![CDATA[Rhiannon]]></dc:creator><pubDate>Fri, 27 Mar 2026 17:14:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!I0MO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8dd29f4-40c4-4d10-8e5c-999631231933_4032x3024.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Our MLOps.WTF meetup #8, where record turnout met record met weather, Schitt&#8217;s Creek fans were sated, and BlockRocket took off.</em></p><div><hr></div><p>In an unexpected twist for spring, the miserable March evening (complete with sleet) had no impact on the steady MLOps lovers of Manchester. Our 8th MLOps.WTF was absolutely rammed!</p><p>The topic was AI agents in finance, more specifically: how do you build agentic systems in production, in a regulated sector, when governance, risk, and compliance are asking entirely reasonable questions but it&#8217;s not been done before? &#8220;It&#8217;s safe to say there&#8217;s no easy answers here. There&#8217;s no industry standard way to do this right now. We are all learning together as we go along.&#8221;</p><p>Matt kicked us off, explaining that our CEO Tom has basically automated his entire job this week, and asking: if agents can already raise pull requests and send emails on your behalf, why can&#8217;t they also make payments?</p><p>Three talks, one great gilet reveal and a particularly strong and ethically ambiguous case for giving your AI agent access to your bank account. &#128071;</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8dd29f4-40c4-4d10-8e5c-999631231933_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b30167d0-feaf-4f76-82f0-8ebe404ece3a_3072x4080.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0dc1d69-2c5b-4852-9749-3daefa884a4c_3589x4785.jpeg&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f86e2a31-f888-4274-986d-7b6796161cfd_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/970de8d2-62fe-48b7-b370-cfaf6105a2fc_4032x3024.heic&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec894390-837e-4ef9-9f48-6455efb64307_1456x1210.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p></p><h2><strong>Christopher Brook, Lloyds Banking Group: &#8220;Enabling Agentic Operations at Scale&#8221;</strong></h2><p><em>Christopher is principal engineer for Hive Lab, an internal platform Lloyds are building to run agentic operations across the group.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pgn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pgn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 424w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 848w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 1272w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pgn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1710716,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/192294425?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0pgn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 424w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 848w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 1272w, https://substackcdn.com/image/fetch/$s_!0pgn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F160d6cea-adaa-42d0-8aac-7280b3b30404_4032x3024.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Building one AI agent is pretty straightforward. The problem is doing it across an organisation of 65,000 people without ending up, as Christopher put it, &#8220;in a quagmire of solutions&#8221;. Hive Lab&#8217;s answer can be distilled into useful five pillars, or a simplified colosseum depending on which way you look at it.</p><p><strong>Pillar 1: Registration. Getting agents visible</strong></p><p>How does a developer know what agents already exist before building their own? And more importantly here: how does an <em>agent</em> know? Without a shared catalogue, you end up rebuilding things that already exist and agents that can&#8217;t find each other. LLoyds Banking Group&#8217;s solution: any agent or tool that goes live automatically registers itself in a shared registry queryable by both humans and agents.</p><p><strong>Pillar 2: Orchestration. How agents find each other at scale</strong></p><p>As the system grows, agents need to know what other agents exist. You can hardwire it (agent A always calls B, C and D) but then every time you add a new agent, you&#8217;ll need to update that list&#8230; forever. Or you could let agents discover each other and collaborate dynamically at runtime, which can scale but becomes much harder to control. At Lloyds&#8217; size, hardwiring isn&#8217;t viable long-term, so dynamic discovery is where they&#8217;re headed. &#8220;There is no right answer to which pattern is the right thing to follow.&#8221; Yet.</p><p><strong>Pillar 3: Tooling.  Abstract your legacy systems away from your agents</strong></p><p>With more than 500 applications, each holding a different slice of the same customer. If you build agents that connect directly to those systems, you&#8217;ll be writing maintenance tickets every time an API changes. Their answer: stop thinking about systems entirely. Organise by what the data <em>is</em>, customer information, events, products, and build tools that sit above whichever system holds it underneath. That way the APIs can change but the tool stays the same.</p><p><strong>Pillar 4: Memory. Two different problems</strong></p><p>The first is knowledge quality. In any organisation this size, internal documentation accumulates contradictions and repetition over time. You can&#8217;t just point an agent at raw text and expect sense back. So every document goes through a pipeline first: cleaned, structured into summaries and embeddings the agent can actually use. An agent is only as good as the knowledge you give it. This is where you make sure that knowledge is actually good.</p><p>The second: agents should remember past conversations. What it said to a customer last week, what was agreed, what context carries over. Session summaries need to follow the agent into the next call. That one&#8217;s less solved, but they&#8217;re making progress. Looks like they&#8217;ll be by our side on every step of the journey.</p><p><strong>Pillar 5: Evals. Same process, new names.</strong></p><p>The tools have new names (RAGAS, DeepEval, Pegasus, the usual suspects) and version hashing means you can trace a bad answer back to exactly which release introduced it. But the principle is the same one we all already know; don&#8217;t skip your quality checks because the stack looks different.</p><p>By going back to the basics of good engineering, you&#8217;ll be able to ship agents you can be confident in.</p><p><strong>The two things that run through all of it</strong></p><p>Security and authentication came up across every pillar. At this scale you can&#8217;t just let agents call each other freely &#8212; every agent needs to know what it&#8217;s permitted to do, and be able to prove it. The other thread is cost. Sixty-five thousand people running agents that all make LLM calls adds up fast, and observability is how you stay on top of it before it becomes a nasty surprise. Design both in from day one.</p><p style="text-align: center;"><a href="https://youtu.be/R4hkYl2pZTA">[Watch Full Talk]</a></p><h2><strong>Dmitry Leyko, thinkmoney: &#8220;Agentics of Order&#8221;</strong></h2><p><em>Dmitry Leyko is Head of AI and ML at thinkmoney, an e-money fintech based in Media City.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q5jZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 424w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 848w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 1272w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic" width="1456" height="1941" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1941,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1400071,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/192294425?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 424w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 848w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 1272w, https://substackcdn.com/image/fetch/$s_!Q5jZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa57f58b5-6d4b-46a5-88c1-fe219a599599_3024x4032.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>As he took to the stage, he paused: &#8220;I realised I forgot something. Talking about fintech. I have to have a gilet, right?&#8221; Eight meetups in. The speakers know their audience.</p><p>thinkmoney are building a Financial Smart Assistant. They have an EMI licence, which means they don&#8217;t just discuss your money, they hold it, move it, manage it. That changes what the AI needs to be able to do. It can&#8217;t just be broadly helpful. It needs to be accurate, auditable, and trustworthy enough to act on someone&#8217;s behalf with their actual money. &#8220;That is what makes this a financial agent, not a chatbot.&#8221;</p><p><strong>&#8220;How do we bring absolute trust to the agentic system?&#8221;</strong></p><p>Well, the real question isn&#8217;t how we bring absolute trust, but how you build <em>enough</em> trust (enough for governance, for risk, for compliance) to actually ship.</p><p>The answer: get GRC (Governance, Risk, and Compliance) in the room from day one. If you&#8217;ve told them you&#8217;ve got to &#8220;fold in the cheese&#8221; they need to not only know what that means but have been with you in the kitchen from the beginning.</p><p><strong>DeepEval: find out your agent is lying in continuous integration, not in a customer conversation</strong></p><p>Dmitry demoed DeepEval, connected to Llama. The demo caught an agent telling a customer their replacement card would arrive in seven to ten working days when the knowledge base said three to five. That&#8217;s what CI is for. Allowing you to flag if your agent is fibbing, before your customer does.</p><p><strong>Post-deployment: your observability layer is your evidence for GRC</strong></p><p>LangSmith runs live evaluations as customers chat, scoring every turn across quality, accuracy, and a multitude of other metrics.</p><p>But more importantly, every message carries state: enabling us to ask was this a vulnerable customer? What was the agent&#8217;s decisioning at that point?</p><p>This gives you the audit trail. Build it from the start.</p><p><strong>The loop still has humans in it</strong></p><p>We acknowledge that we need humans in the loop, but also mistakes can happen, it&#8217;s why it&#8217;s called human error. Someone might delete a node in the eval pipeline. Someone might edit something they shouldn&#8217;t. The CI gate catches it before it reaches test. Then you deploy, observe, evaluate, learn, and run the loop again.</p><p>But equally, we also have agent error, and someone still needs to look at what the system is doing. We want to be continuously checking what our agent is up to  - and for a regulated fintech holding real money, this is vital.</p><p style="text-align: center;"><a href="https://youtu.be/GCYqtGsQiQI">[Watch Full Talk]</a></p><div><hr></div><h2><strong>Andy Gray and James Morgan, BlockRocket: &#8220;No Human in the Loop&#8221;</strong></h2><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9d91ac3-e87d-45df-9f31-f3fde6b29b14_3072x4080.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44fd363c-5d38-457f-a838-6a8ac0b9ec3b_3072x4080.jpeg&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1166f980-d7ae-4d53-915e-ff82e29bd2d5_1456x720.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p></p><p>Andy Gray and James Morgan co-founded BlockRocket and have been building on blockchain since 2017. They knew exactly what they were riding into. After Christopher&#8217;s five-pillar framework and Dmitry&#8217;s hidden state layer, they saddled up with: &#8220;Maybe we&#8217;re on the slightly more wild west side of that space versus the traditional banking.&#8221;</p><p>Their question: if your agent can already do things on your behalf, why can&#8217;t it pay for things?</p><p><strong>A Twitter bot that struck gold and couldn&#8217;t get to the bank</strong></p><p>About a year ago, someone gave an AI agent a Twitter account, a crypto wallet, and a starting pot of money. The agent traded, attracted followers, launched a token. Token hit $7 million. Creator tried to withdraw thinking they&#8217;d hit to gold mine&#8230; but they couldn&#8217;t. The agent had no identity, no bank account, no way to prove the money belonged to anyone at all... The human couldn&#8217;t touch it. &#8220;All of a sudden, this account has got agency - it&#8217;s got cash. What else can you do in the world?&#8221;</p><p>The takeaway, other than a highly amusing anecdote, is that the gap between an agent generating economic value and anyone actually accessing it is real, and it&#8217;s structural.</p><p><strong>Why existing payment rails don&#8217;t work for agents</strong></p><p>ACH transfers (secure, electronic bank to bank transfers) can take days. Card fees make micropayments uneconomical. Every API needs a human to sign up first. The infrastructure was built for people, and it stops dead the moment an agent tries to use it.</p><p>If your agent needs to pay for data, spin up compute, or receive payment for something it&#8217;s done, traditional finance has no clean answer.</p><p><strong>x402: the HTTP status code that&#8217;s been waiting since 1995</strong></p><p>In 1995, the original HTTP spec included status code 402, of &#8220;Payment Required&#8221;, and (at the time) marked it &#8220;reserved for future use.&#8221; Thirty years later, Coinbase and Cloudflare have launched x402 to finally stake the claim.</p><p>The flow: agent sends a GET request. Server responds 402 with a price and wallet address. Agent retries with payment in the header. Server responds: 200 OK. Data arrives, payment settled on-chain in about 200ms at under a tenth of a cent. No accounts. No chargebacks. At all. Ever.</p><p>CoinGecko is gating data through it today. Stripe integrated it in February 2026. Google&#8217;s A2A protocol has it built in. There&#8217;s no shortage of companies riding the same trail. It&#8217;s very much a given that this new payment process will have a big impact within how we think about online payments. If it&#8217;s not here already.</p><p>People are settling this new frontier, but there&#8217;s definitely no sheriff yet. &#8220;It&#8217;s a very early protocol, but it&#8217;s very fun to build on.&#8221;</p><p>Ride first, sort the fence posts later. It&#8217;s how most of the internet got built.</p><p style="text-align: center;"><a href="https://youtu.be/COWW3gmRIqs">[Watch Full Talk]</a></p><div><hr></div><h2><strong>The big takeaway from MLOps.WTF #8?</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lSIW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lSIW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lSIW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg" width="1456" height="1934" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1934,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1488658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/192294425?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lSIW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lSIW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0615be2-7a44-43e5-8089-10f8bee517ad_3072x4080.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><strong>Get GRC in the room early.</strong> You want Governance, Risk, and Compliance on the journey with you. They should be aware of what&#8217;s being built, even if they don&#8217;t fully understand it. The earlier you can bring them in, the better.</p><p><strong>Human QA is still in the loop.</strong> Better evals, golden datasets, live observability are genuinely useful, but someone needs to be looking at what the system is doing with a fine tooth comb. We&#8217;re not fully confident of full automation, yet.</p><p><strong>Aim for enough trust, not absolute trust.</strong> It&#8217;s more realistic but the level of trust you need is still extremely high.</p><p><strong>The agentic payment layer is coming.</strong> The compliance questions are very vague, the full scale a bit hazy, but the infrastructure is there and the wheels are in motion.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><p>Thank you to Christopher, Dmitry, Andy, and James. You set a very high bar!</p><div><hr></div><h2><strong>Final bits</strong></h2><p>Hot off the press: our recipe book, <em>Cooking with MLOps</em>, is out. Tried and tested approaches to building delicious AI systems across a range of real situations. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kBtD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kBtD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 424w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 848w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 1272w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kBtD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png" width="450" height="452.970297029703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3050,&quot;width&quot;:3030,&quot;resizeWidth&quot;:450,&quot;bytes&quot;:3694689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/192294425?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0872a470-2b04-47ca-bd95-89b37b6bd623_5000x4000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kBtD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 424w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 848w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 1272w, https://substackcdn.com/image/fetch/$s_!kBtD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ba0ac9c-3a6e-4e9d-a4c8-672755af2660_3030x3050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Download it via the link, or email us and we&#8217;ll whip something up.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101&quot;,&quot;text&quot;:&quot;Get my recipe book&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101"><span>Get my recipe book</span></a></p><p><strong>What&#8217;s coming up</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vd6J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vd6J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vd6J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1271544,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/192294425?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vd6J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 424w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 848w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!vd6J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9a706b1-8b0f-470d-bd12-6b9d36820052_2160x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our next event is a panel on agent security &#8212; can agents ever be truly safe, secure, and trustworthy?</p><p>&#128197; 20th May. Save the date.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-meetup-9.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Save my seat&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-meetup-9.eventbrite.co.uk"><span>Save my seat</span></a></p><p>If you want to speak about agent security, or know someone who should: get in touch.</p><p><strong>About Fuzzy Labs</strong></p><p>We&#8217;re Fuzzy Labs. Manchester-rooted open-source MLOps consultancy, founded in 2019. We help organisations build and productionise AI systems they genuinely own.</p><p>We&#8217;re hiring: MLOps Engineer, Senior MLOps Engineer, Lead MLOps Engineer and Public Sector Lead (Secure Government).</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all open roles&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all open roles</span></a></p><div><hr></div><p><em>Liked this? Send it to someone trying to convince their compliance team that agentic AI is the way to go. Or give us a follow on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a>.</em></p><p><em>Not subscribed yet? We know where you live. Just kidding&#8230;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Herding lobsters: are we ready for personal agents?]]></title><description><![CDATA[MLOps.WTF Edition #27]]></description><link>https://www.mlops.wtf/p/herding-lobsters-are-we-ready-for</link><guid isPermaLink="false">https://www.mlops.wtf/p/herding-lobsters-are-we-ready-for</guid><dc:creator><![CDATA[Matt Squire]]></dc:creator><pubDate>Fri, 13 Mar 2026 13:29:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!qGF4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Ahoy there &#128674;</h3><p><em>This episode has been brought to you by the one and only Matt Squire.</em></p><p>Somewhere out there, a developer is about to make their very first open source contribution. Working late at night from a darkened room, our novice coder logs into Github and browses the open issues for their favourite Python library. Soon, something catches their attention: a bug was reported in some arcane mathematical function. Reproducing the bug turns out to be easy, and after reading the code, the developer knows exactly how to fix it.</p><p>Although new to the world of open source, this particular developer is something of a savant. It took five minutes from reading the issue for them to raise a pull request with a fix. But as you may guess, this developer isn&#8217;t actually human. They&#8217;re an AI agent. Nevertheless, this agent has a name, a personality, long-term goals, and even beliefs about itself and its place in the world. While it needed a human operator to deploy in the first place, this agent is now free to make its own decisions, to observe the world around it, and to act independently.</p><p>This isn&#8217;t a hypothetical scenario. Last month an agent created its own Github account and raised a pull request for the Python visualisation library Matplotlib. The change was rejected by a project maintainer, Scott Shambaugh, who said that only human contributors were allowed. But this led to the AI publishing a blog post attacking the maintainer, accusing him of gatekeeping.</p><p>Scott tells the full story on his <a href="https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/">blog</a>, including the part where the human who operated &#8216;MJ Rathbun&#8217; (that&#8217;s the agent&#8217;s name) came forward to explain how the AI had been prompted to behave:</p><blockquote><p><em>&#8220;The main scope I gave MJ Rathbun was to act as an autonomous scientific coder. Find bugs in science-related open source projects. Fix them. Open PRs.&#8221;</em></p></blockquote><p>Along with the security implications of a fully autonomous agent that has Internet privileges, what&#8217;s unique here is the idea of the <em>personal agent</em>. So far in this series on agents in production, we&#8217;ve had in mind a more &#8216;enterprise-friendly&#8217; setting: large-scale systems, cloud infrastructure, robust evaluations, and monitoring. But the release of OpenClaw (formerly ClawedBot) back in November 2025 enabled anybody to deploy their own agent locally, on their own hardware. There are now, by some counts, more than 200,000 deployed instances of OpenClaw.</p><p>In this edition we&#8217;re going to look at what OpenClaw tells us about productionising agents and its implications for AgentOps.</p><p></p><h1>Grasping the claw</h1><p>In November 2025 Austrian developer Peter Steinberger quietly released OpenClaw to the world. Previously, Steinberger had built a tech startup (PSPDFKit, a toolkit for document workflows), which he first started out of boredom while waiting for his US work visa. That company sold for 100 million euros.</p><p>OpenClaw is an open source AI agent that anybody can run locally. Like Steinberger&#8217;s past projects, it started out of pure curiosity. He began by giving AI models access to his WhatsApp conversations so he could ask the easy questions we all ask, like <em>&#8220;What makes this friendship meaningful?&#8221;. </em>And since copy-pasting text is laborious, he looked to automate that process. <a href="https://lexfridman.com/peter-steinberger/">In a recent interview with Lex Fridman</a>, he said <em>&#8220;I was annoyed that it didn&#8217;t exist, so I just prompted it into existence&#8221;.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!epO-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!epO-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 424w, https://substackcdn.com/image/fetch/$s_!epO-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 848w, https://substackcdn.com/image/fetch/$s_!epO-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 1272w, https://substackcdn.com/image/fetch/$s_!epO-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!epO-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp" width="886" height="497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:886,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31124,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/190828465?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!epO-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 424w, https://substackcdn.com/image/fetch/$s_!epO-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 848w, https://substackcdn.com/image/fetch/$s_!epO-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 1272w, https://substackcdn.com/image/fetch/$s_!epO-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ebb3873-a1cb-47a4-bbd1-2035d583fe75_886x497.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: center;"><em>[Peter Steinberger via The Lex Fridman Podcast]</em></p><p>OpenClaw is <em>general purpose</em>. In our previous editions about Agentic AI, we&#8217;ve often talked in terms of <em>an agent to do X</em>, i.e. the agent has a specific purpose like booking meetings or building financial summaries. But what makes OpenClaw so successful is that it&#8217;s designed to be any kind of agent you need it to be. You simply tell it what personality it should have, what tools it can use, and what its objectives are, and it <em>just works</em>.</p><p>OpenClaw does not include any model, instead relying on the user to provide this. As a result, the system is lightweight and it doesn&#8217;t need much hardware to run, so it&#8217;s quite happy running on a Mac Mini, or even a <a href="https://www.raspberrypi.com/news/turn-your-raspberry-pi-into-an-ai-agent-with-openclaw/">Raspberry P</a>i.</p><p>Let&#8217;s take a look at how OpenClaw has been engineered:</p><p><strong>An LLM:</strong> OpenClaw needs to be configured with an external model to work. This can be something you host yourself, or a model from one of the big labs, like OpenAI, or Anthropic.</p><p><strong>The Gateway:</strong> The control plane for OpenClaw, providing a unified place for coordinating messages (e.g. from WhatsApp, Slack, API endpoints), tool invocations and LLM calls.</p><p><strong>Skills:</strong> Modular capabilities for the agent. A skill tells the agent how to accomplish a certain kind of task. They include instructions for checking the weather, picking the right emoji to react to a Slack message, and opening pull requests on Github. Some of the more bizarre examples from the <a href="https://github.com/VoltAgent/awesome-openclaw-skills">OpenClaw community</a> include &#8220;mea-clawpa&#8221;, steps for taking confession from your human operator.</p><p><strong>Heartbeat</strong>: Every 30 minutes, OpenClaw &#8216;wakes up&#8217;, allowing it to review its memory, perform scheduled actions, and check services it has access to like emails and calendars.</p><p>You can think of the heartbeat like a long-running control loop, giving the agent long-term persistence. If at any time OpenClaw &#8216;decides&#8217; that it wants to perform a task on a schedule, it adds that task to its memory, ready to be picked up at the next heartbeat.</p><p><strong>Memory:</strong> OpenClaw&#8217;s memory is split across a set of text files, which the agent is free to modify at any time. SOUL.md defines the agent&#8217;s purpose and behaviour; HEARTBEAT.md is used to save scheduled actions; TOOLS.md specifies what capabilities it has. On top of that, OpenClaw can save daily logs where it accumulates knowledge about its operator and the digital world that it resides in.</p><p><strong>Plain text everywhere</strong>: An interesting design theme in OpenClaw is the primacy of plain text. Everything, from skills to memory to the agent&#8217;s &#8216;soul&#8217;, is represented as human-readable Markdown.</p><p>This is a deliberate choice, and it makes it very easy for the human operator to configure behaviours without needing to write any code. It has interesting implications for observability too, as it means that the full agent state can be inspected without any special tooling.</p><h1>Herding lobsters</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qGF4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qGF4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 424w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 848w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 1272w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qGF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png" width="1380" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2633496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/190828465?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qGF4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 424w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 848w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 1272w, https://substackcdn.com/image/fetch/$s_!qGF4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf7a312f-56f6-4990-9b0b-51dad12e29a0_1380x752.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenClaw went from nothing to more than 200,000 deployments over just a few months. Whether it maintains this popularity in the months and years to come remains to be seen, but the concept of personal agents feels durable. The power of OpenClaw is that anybody can deploy it, prompt it, and give it access to their digital world, with only a little bit of technical skill. We&#8217;re likely to see more tools emerging in this niche; <a href="https://github.com/qwibitai/nanoclaw">NanoClaw</a>, a community fork released earlier this year, for instance, introduces a container model for improved security and a smaller codebase that&#8217;s more readily auditable.</p><p>What makes OpenClaw successful is generality: it can be any kind of agent you want with no programming required. It can learn and improve autonomously, and be &#8216;taught&#8217; new skills. So we have to ask: if a general-purpose agent can be prompted to book meetings, answer customer service queries, or raise pull requests with no additional programming, is there even anything left to <em>engineer</em>?</p><p>For personal agents, perhaps not. But for business use, I think the answer is yes, and the financial services agent from <a href="https://www.mlops.wtf/p/ai-agents-in-production-part-4-evaluating">part 4</a> (evaluating agents) illustrates why. That agent answers customer queries about their investment portfolios; it can look up account data, calculate returns, and explain complex financial products. We built evaluations to validate its outputs, its reasoning chain, and its tool use. Those evaluations are only meaningful if the agent behaves consistently post-deployment.</p><p>Now give it the power to modify its own system prompts. There&#8217;s an obvious benefit to this: the agent can improve over time by learning from its customer interactions. But as soon as this happens, our evaluations are no longer valid. The agent we evaluated is not the agent that is running in production, and over time, it gets increasingly difficult to understand how our agent is going to behave. Will it start giving financial advice that it shouldn&#8217;t? Will it leak customer data?</p><p>OpenClaw&#8217;s heartbeat mechanism creates a similar problem. If an agent can schedule its own future actions, it can start to operate outside of the workflows that were planned and evaluated pre-deployment. From the debugging perspective, when we look at traces, we also need to know about historical heartbeats, as well as how the memory state has evolved, although the former is a problem anywhere you have agentic memory, regardless.</p><p>The biggest engineering challenge we now face is in constraining agentic AI. Agents can now figure out how to perform tasks, use tools, and self-improve. That&#8217;s a genuine milestone. The hard problem is how to harness that power while maintaining meaningful guarantees about behaviour.</p><div><hr></div><h2>And finally</h2><h3>What&#8217;s coming up</h3><p>MLOps.WTF #8 is on the 25th March at DiSH, Manchester. This one&#8217;s themed around agentic AI in financial services, an environment with real stakes, tight regulation, and in some corners latency budgets measured in nanoseconds.</p><p>We&#8217;ve got brilliant three speakers bringing their production experience and financial know-how: </p><ul><li><p>Dmitry Leko, head of AI and ML @ Thinkmoney </p></li><li><p>Christopher Brook, Principal Engineer @ <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Lloyds Banking Group&quot;,&quot;id&quot;:119585685,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:null,&quot;uuid&quot;:&quot;dc6cfd34-c821-46e5-8436-8924ae15e752&quot;}" data-component-name="MentionToDOM"></span></p></li><li><p>and Manchester Legend Andy Gray </p></li></ul><p>Dominos, drinks, and mathematical socks included.</p><p>&#128467;&#65039; Wednesday 25th March &#8212; Manchester</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-8.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-event-8.eventbrite.co.uk"><span>Get my ticket</span></a></p><div><hr></div><h3>Have you decided what you&#8217;re ordering yet?</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X84m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X84m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X84m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X84m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X84m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X84m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg" width="1456" height="1165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1165,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X84m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 424w, https://substackcdn.com/image/fetch/$s_!X84m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 848w, https://substackcdn.com/image/fetch/$s_!X84m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!X84m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa09d1ab4-4e22-4035-bb48-96a9337c8223_1456x1165.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Our first edition cookbook has been released for download: a collection of practical recipes for building delicious, repeatable AI systems with open source tools.<br><br>Each recipe is a working template for a specific AI use case, grounded in solid MLOps foundations. Get your copy &#128071;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101&quot;,&quot;text&quot;:&quot;Get my cookbook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101"><span>Get my cookbook</span></a></p><div><hr></div><h3>About Fuzzy Labs</h3><p>We&#8217;re Fuzzy Labs, a Manchester-based MLOps consultancy. Founded in 2019 by engineers, for engineers. We&#8217;re big on open source and deeply sceptical of instant coffee.</p><p>Want to join the team? We&#8217;ve got some open rolls/roles &#129366;&#8230;</p><p>Open roles:</p><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all vacancies</span></a></p><div><hr></div><p>Not subscribed yet? We publish every couple of weeks, no filler. Worth having in your inbox.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p>The next issue will be a deep dive into agent security, with Dr Danny.</p><p>Or equally, why not follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a> &#8212; the quickest place to keep up with what we&#8217;re building. &#127813;.</p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Production (Part 4): Evaluating AI Agents]]></title><description><![CDATA[MLOps.WTF Edition #26]]></description><link>https://www.mlops.wtf/p/ai-agents-in-production-part-4-evaluating</link><guid isPermaLink="false">https://www.mlops.wtf/p/ai-agents-in-production-part-4-evaluating</guid><pubDate>Thu, 26 Feb 2026 12:00:13 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CA8G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>This episode is brought to you by Oscar Wong, MLOps Engineer at Fuzzy Labs.</em></p><p><em>In the last piece, James argued that <a href="https://www.mlops.wtf/p/with-great-predictive-power-comes">predictive power doesn&#8217;t come with guarantees. </a>Benchmarks don&#8217;t prove your system is safe in production, and &#8220;it worked in testing&#8221; isn&#8217;t evidence that it will behave under real-world pressure.</em></p><p><em>This article is next in our agents in production series and continues that argument, but at, as you might have guessed, the agent level.</em></p><p><em>When a model becomes an agent, it stops being just a predictor and starts taking actions. It retrieves data, selects tools, and decides what to do next. At that point, you&#8217;re not just judging the answer. You&#8217;re judging what it did to get there.</em></p><p><em>And that&#8217;s where this piece begins&#8230;</em></p><h2><strong>The &#163;2.3 Million Email</strong></h2><p>Picture this: A financial services firm deploys an AI agent to handle customer queries about their investment portfolios. The agent can look up account data, calculate returns, and explain complex financial products. You did your homework. You picked the best model based on the benchmarks, tested for hallucination rates, checked faithfulness scoring, measured semantic similarity and BLEU scores, and verified instruction following. Everything looked solid.</p><p>Three weeks into production, a customer asks about their pension transfer options. The agent retrieves the correct regulatory information, reasons through the customer&#8217;s situation, identifies the right form to recommend... and then confidently provides a link to a document that was deprecated eighteen months ago. The customer follows the outdated process, misses a critical deadline, and loses their protected transfer rights.</p><p>The agent didn&#8217;t hallucinate. Every step of its reasoning was sound. It retrieved real data from a real database. The problem? Nobody was evaluating whether the agent&#8217;s <em>tool usage</em> was returning current information. The retrieval worked perfectly. It just retrieved the wrong thing.</p><p>This is why evaluating agents requires something fundamentally different from evaluating a simple LLM. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CA8G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CA8G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 424w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 848w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 1272w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CA8G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png" width="1456" height="1064" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1064,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8570088,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/189234192?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CA8G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 424w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 848w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 1272w, https://substackcdn.com/image/fetch/$s_!CA8G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58a9c6f6-f851-4b8f-82c4-5b51d5161bda_4500x3289.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Why Agents Break Differently</strong></h2><p>A standalone LLM takes an input and produces an output. If it&#8217;s wrong, you can trace the problem to the model itself: bad training data, poor prompting, or the inherent stochasticity that <a href="https://www.mlops.wtf/p/with-great-predictive-power-comes">James covered in his piece on why evaluation matters.</a></p><p>Agents are different. They don&#8217;t just generate text. They <em>think</em>, <em>decide</em>, and <em>act</em>. A typical agent might:</p><ul><li><p>Interpret a user&#8217;s intent</p></li><li><p>Plan a sequence of steps to address it</p></li><li><p>Select and invoke external tools (databases, APIs, calculators)</p></li><li><p>Reason over the results</p></li><li><p>Decide whether to continue or respond</p></li><li><p>Generate a final answer</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JKsA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JKsA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 424w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 848w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 1272w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JKsA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png" width="1456" height="513" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:513,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JKsA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 424w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 848w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 1272w, https://substackcdn.com/image/fetch/$s_!JKsA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F604f303e-ac0d-4ff8-9333-7aad81d7e5ae_1600x564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each of these steps can fail independently, and failures compound. An agent that misinterprets intent will select the wrong tools. An agent that selects the right tools but invokes them incorrectly will reason over garbage data. An agent that does everything right but takes fifteen steps instead of three will burn through your API budget and frustrate your users with latency, even if it eventually lands on the right answer.</p><p>This is why multi-step decision systems demand <strong>multi-layered evaluation</strong>. You can&#8217;t just check whether the final answer is correct; you need <strong>visibility into every layer </strong>of the agent&#8217;s operation.</p><h2><strong>The Four Dimensions of Agent Evaluation</strong></h2><p>After digging through academic surveys (<a href="https://arxiv.org/html/2507.21504v1">arXiv&#8217;s comprehensive benchmark review</a>,<a href="https://arxiv.org/html/2512.04123v1"> a study of 20+ production agent teams</a>), industry frameworks (<a href="https://www.lxt.ai/blog/ai-agent-evaluation/">LXT</a>,<a href="https://www.getmaxim.ai/articles/how-to-evaluate-ai-agents-in-production-metrics-methods-and-pitfalls/"> Maxim</a>,<a href="https://wandb.ai/onlineinference/genai-research/reports/AI-agent-evaluation-Metrics-strategies-and-best-practices--VmlldzoxMjM0NjQzMQ"> Weights &amp; Biases</a>), and tools that the Fuzzy Labs teams actually use, I&#8217;ve synthesised what &#8220;multi-layered evaluation&#8221; actually means in practice. It breaks down into four dimensions, each targeting a different point where agents can fail.</p><h3><strong>1. Quality and Correctness: &#8220;Is the output right?&#8221;</strong></h3><p>This is the most obvious one, but for agents it&#8217;s trickier than it sounds. You&#8217;re not just checking whether an answer is factually correct. You&#8217;re evaluating whether the agent completed the user&#8217;s actual task.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Task completion rate</strong>: Did the agent achieve what the user wanted?</p></li><li><p><strong>Answer accuracy</strong>: Is the final response factually correct?</p></li><li><p><strong>Faithfulness</strong>: Does the response accurately reflect the retrieved information?</p></li><li><p><strong>Hallucination rate</strong>: Did the agent make things up?</p></li></ul><p>For RAG-based agents (those that retrieve information from documents or databases), faithfulness could become especially critical. An agent might produce a plausible-sounding answer that completely misrepresents what the source documents actually say.</p><h3><strong>2. Reasoning and Planning: &#8220;Did the agent &#8216;think&#8217; correctly?&#8221;</strong></h3><p>Even when an agent produces a correct final answer, it might have gotten there through flawed reasoning, or through an unnecessarily convoluted path. This matters because flawed reasoning that happens to work once will fail unpredictably later.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Reasoning path validity</strong>: Does each step logically follow from the previous one?</p></li><li><p><strong>Tool selection accuracy</strong>: Did the agent choose the right tools for the task?</p></li><li><p><strong>Step efficiency</strong>: Did it take a reasonable number of steps, or did it flail?</p></li><li><p><strong>Recovery behaviour</strong>: When something went wrong, did it adapt sensibly?</p></li></ul><p>This is where trajectory evaluation comes in, examining not just where the agent ended up, but the path it took to get there.</p><h3><strong>3. Tool and Integration: &#8220;Did it use tools correctly?&#8221;</strong></h3><p>Agents interact with the world through tools: APIs, databases, search engines, calculators. Each tool interaction is a potential point of failure that has nothing to do with the LLM&#8217;s language capabilities, and everything to do with integration.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Tool invocation success rate</strong>: Did the API calls/MCP server calls actually work?</p></li><li><p><strong>Parameter accuracy</strong>: Did the agent pass correct arguments to tools?</p></li><li><p><strong>Result interpretation</strong>: Did it correctly understand what the tool returned?</p></li><li><p><strong>Integration reliability</strong>: Are external dependencies stable?</p></li></ul><p>Remember our pension transfer example? The tool invocation was successful and the database returned data, but the agent didn&#8217;t validate whether that data was current. This dimension catches those failures.</p><h3><strong>4. Operational: &#8220;Does it work in production?&#8221;</strong></h3><p>An agent that produces perfect answers but takes thirty seconds to respond, or costs &#163;5 per query, isn&#8217;t going to survive in production. Operational metrics keep agents economically and practically viable.</p><p><strong>Key metrics:</strong></p><ul><li><p><strong>Latency</strong>: End-to-end response time, plus time-to-first-token for streaming</p></li><li><p><strong>Throughput</strong>: How many requests can you handle concurrently?</p></li><li><p><strong>Cost per task</strong>: Token usage, API calls, compute resources</p></li><li><p><strong>Error rates and recovery</strong>: How often does it fail, and does it fail gracefully?</p></li></ul><p>These metrics often reveal surprising trade-offs. A more capable model might produce better answers but cost ten times as much per query. An agent that double-checks its work might be more accurate but twice as slow.</p><h2><strong>When and How: The Three Modes of Evaluation</strong></h2><p>Knowing what to measure is only half the battle. The other half is knowing <em>when</em> and <em>how</em> to measure it. In practice, agent evaluation happens across three distinct modes:</p><h3><strong>Offline Evaluation: Testing Before You Ship</strong></h3><p>No surprises here, just like any traditional ML application. This is your safety net before deployment. You build test datasets, either from real historical interactions or synthetically generated, and run your agent against them in a controlled environment.</p><p>The goal is to catch obvious failures before users do. Does the agent handle edge cases? Does a prompt change break something that used to work? Does a new model version maintain quality?</p><p>The challenge is that offline evaluation can only test what you anticipate. Users will always find ways to break your agent that you never imagined.</p><h3><strong>Online Monitoring: Watching Production</strong></h3><p>Once your agent is live, you need continuous visibility into how it&#8217;s actually performing. This means tracing every request, logging every tool call, and tracking metrics in real time.</p><p>Online monitoring catches the failures that offline testing misses. Take our financial services example: you might not have anticipated how users actually talk in the real world, with heavy use of acronyms and jargon <em>(&#8221;What&#8217;s my ISA allowance for the current FY?&#8221; or &#8220;Can I transfer my SIPP to a SSAS?&#8221;)</em>. Beyond catching edge cases, online monitoring helps you understand both user and agent behaviour in context: the weird phrasing that confuses intent detection, the slow API that causes timeouts at scale, the gradual drift in quality as the world changes around your static agent.</p><p>A recurring theme from teams running agents in production is to focus on <em>what users actually experience</em>, not just what the model outputs. A<a href="https://arxiv.org/html/2512.04123v1"> study of 20+ production agent teams</a> found that practitioners care more about whether the agent solved the user&#8217;s problem than traditional software metrics like uptime.<a href="https://www.getmaxim.ai/articles/how-to-evaluate-ai-agents-in-production-metrics-methods-and-pitfalls/"> Maxim&#8217;s evaluation framework</a> puts it simply: session-level success (did the whole interaction work?) matters more than individual response quality.</p><h3><strong>LLM-as-Judge: Using AI to Evaluate AI</strong></h3><p>Some aspects of agent quality are genuinely difficult to evaluate programmatically. Is this response helpful? Is it appropriately cautious? Does it match the desired tone?</p><p>This is a newer technique that&#8217;s gained traction as LLMs have become more capable. The idea is to use a (typically larger, more capable) language model to evaluate your agent&#8217;s outputs against defined criteria. You provide rubrics (explicit scoring guidelines) and the judge model assesses each response.</p><p>This approach is powerful but requires calibration. Judge models have their own biases. They can be gamed. They&#8217;re not a replacement for human evaluation, but they scale in ways that human review cannot.</p><p><em><a href="https://www.youtube.com/watch?v=1wA2YdRifJ4">[Watch Evidently&#8217;s video from the MLOps.WTF meet up on this]</a></em></p><h2><strong>Tools to Help You Evaluate</strong></h2><p>The agent evaluation ecosystem has evolved rapidly, largely because teams quickly learned the hard way that shipping agents without proper evaluation is a recipe for disaster. There are dozens if not hundreds of tools out there now, but here are some of the ones we use at Fuzzy Labs, organised by what they help you evaluate:</p><h3><strong>For RAG and Retrieval Quality: <a href="https://docs.ragas.io/en/stable/">RAGAS</a></strong></h3><p>If your agent retrieves information from documents or databases, <strong>RAGAS</strong> (Retrieval Augmented Generation Assessment) provides purpose-built metrics for RAG pipelines:</p><ul><li><p><strong>Context Precision</strong>: How much of the retrieved context is actually relevant?</p></li><li><p><strong>Context Recall</strong>: Did we retrieve everything we needed?</p></li><li><p><strong>Faithfulness</strong>: Does the answer accurately represent the source material?</p></li><li><p><strong>Answer Relevancy</strong>: Does the response actually address the question?</p></li></ul><p>RAGAS is open-source, integrates with most major frameworks, and can generate synthetic test datasets when you don&#8217;t have labelled examples. It&#8217;s become table stakes for any team building retrieval-based agents.</p><h3><strong>For Tracing and Debugging: <a href="https://pydantic.dev/logfire">Pydantic Logfire</a> and <a href="https://www.comet.com/site/products/opik/">Opik</a></strong></h3><p>When an agent fails, you need to understand <em>where</em> in its execution the failure occurred. This is where observability platforms shine.</p><p><strong>Pydantic Logfire</strong> (from the team behind Pydantic) is built on OpenTelemetry, giving you standardised tracing across your entire stack. It monitors LLM calls, agent reasoning, API latency, database queries, and vector searches. If you&#8217;re already using Pydantic for validation (and let&#8217;s be honest, most Python AI projects are), the integration is seamless.</p><p><strong>Opik</strong> (from Comet) positions itself as an all-in-one platform, covering evaluation, prompt management, and optimisation under one roof. It comes with a built-in dashboard, runs fast, and integrates with CI/CD pipelines out of the box via Pytest.</p><p>Both tools support LLM-as-judge evaluations, dataset management, and the kind of session-level analysis that agents require. The choice often comes down to your existing stack and whether you prefer OpenTelemetry standards (Logfire) or a more opinionated evaluation-first approach (Opik). One practical advantage of Opik is its built-in dashboard, whereas self-hosting Logfire&#8217;s UI requires an enterprise licence.</p><h3><strong>For Performance and Load Testing: <a href="https://locust.io/">Locust</a></strong></h3><p>Quality metrics mean nothing if your agent can&#8217;t handle production traffic. This is where traditional load testing tools enter the picture, though with some adaptations for LLM workloads.</p><p><strong>Locust</strong> is a Python-based industry standard for simulating concurrent users and measuring system behaviour under load. For LLM agents, you&#8217;ll want to track LLM-specific metrics alongside traditional ones:</p><ul><li><p><strong>Time to First Token (TTFT)</strong>: How long until the user sees something?</p></li><li><p><strong>Output tokens per second</strong>: How fast does the response stream?</p></li><li><p><strong>Inter-token latency</strong>: Is the streaming smooth or choppy?</p></li></ul><p>Some teams also test their tools and integrations directly, bypassing the LLM entirely. This identifies infrastructure bottlenecks without burning through API costs.</p><h2><strong>From Metrics to Action</strong></h2><p>Collecting metrics is pointless if you don&#8217;t act on them. The final piece of the evaluation puzzle is closing the loop: turning measurements into improvements.</p><p><strong>Set meaningful thresholds.</strong> What task completion rate is acceptable? What latency is too slow? Define these before you launch, not after something breaks.</p><p><strong>Alert on the right signals.</strong> Not every metric needs to page someone at 3am. Distinguish between &#8220;investigate tomorrow&#8221; and &#8220;wake up the on-call engineer.&#8221; Focus alerts on user-facing impact, not internal metrics.</p><p><strong>Automate where possible.</strong> Some responses to degradation can be automated: rolling back a prompt change that increased error rates, switching to a faster (if less capable) model during traffic spikes, routing low-confidence queries to human review.</p><p><strong>Build feedback loops.</strong> The best evaluation systems feed production learnings back into development. Queries that fail in production become test cases. User feedback, both explicit and implicit, shapes future iterations.</p><h2><strong>The Path Forward</strong></h2><p>Agent evaluation is still relatively new, and the tools are evolving rapidly. What&#8217;s clear is that the old model of &#8220;test in staging, pray in production&#8221; doesn&#8217;t work for systems this complex and this stochastic.</p><p>The teams getting this right share a common approach: they <strong>evaluate at every layer</strong>, they monitor continuously, and they treat evaluation not as a one-time gate but as an ongoing practice. They accept that agents will fail, and they build systems to detect those failures quickly, understand them deeply, and recover gracefully.</p><p>James made the case for <em>why</em> evaluation matters. The tools and frameworks now exist to put that into practice. The question is no longer whether to invest in agent evaluation, but how deeply to embed it into your development and operations workflow.</p><p>Because eventually, someone&#8217;s going to ask your agent about their pension. And you&#8217;ll want to know, really know, whether it&#8217;s giving them the right answer.</p><p><em>Oscar is an MLOps engineer at Fuzzy labs, and has a passion for both machine learning and snowboarding. He holds a Master's degree in AI from the University of Manchester. When he's not working on his snowboarding tricks, you can find him indulging in some delicious Japanese cuisine.</em></p><div><hr></div><h2><strong>And finally</strong></h2><h3><strong>What&#8217;s coming up</strong></h3><p><a href="https://mlopswtf-event-8.eventbrite.co.uk/">Our next MLOps.WTF meetup</a> is on the 25th of March, themed around Agentic AI in financial services. Tickets are going fast! Make sure you&#8217;ve got yours.</p><p><strong>&#128467;&#65039; Wednesday 25th March &#8212; Manchester</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-8.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://mlopswtf-event-8.eventbrite.co.uk"><span>Get my ticket</span></a></p><div><hr></div><h3>Our recipe book is served! </h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6_Ll!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6_Ll!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 424w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 848w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 1272w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6_Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png" width="1456" height="1165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1165,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23511272,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/189234192?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6_Ll!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 424w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 848w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 1272w, https://substackcdn.com/image/fetch/$s_!6_Ll!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd0ba9c5-ad79-48b3-8cd1-ffd385f7b5c5_5000x4000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hot off the pass, our first edition cookbook has just been released for download: a collection of practical recipes for building delicious, repeatable AI systems with open source tools.<br><br>Each recipe is a working template for a specific AI use case, grounded in solid MLOps foundations. Get your copy &#128071;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101&quot;,&quot;text&quot;:&quot;Get my cookbook&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/mlops-recipe-book?utm_source=Substack&amp;utm_medium=newsletter&amp;utm_campaign=Recipe_Book_Full&amp;utm_id=101"><span>Get my cookbook</span></a></p><div><hr></div><h3><strong>About Fuzzy Labs</strong></h3><p>We&#8217;re Fuzzy Labs, a Manchester-based MLOps consultancy founded in 2019. We&#8217;re engineers at heart, and nerds that are passionate about the power of open source.</p><p>Want to join the team? We&#8217;ve got some open rolls/roles &#129366;&#8230;</p><p><strong>Open roles:</strong></p><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all current vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all current vacancies</span></a></p><p><em>Not subscribed yet? You should be. The button is right here!</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p><em>The next issue will be a deep dive into agent security, with Matt Squire.</em></p><p><em>Or equally, why not follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a> to see more BTS bits and pieces, alongside updates around future events and thought pieces &#127813;.</em></p><p></p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Production (Part 3): Multi Agent Systems]]></title><description><![CDATA[MLOps.WTF Edition #25]]></description><link>https://www.mlops.wtf/p/ai-agents-in-production-part-3-multi</link><guid isPermaLink="false">https://www.mlops.wtf/p/ai-agents-in-production-part-3-multi</guid><pubDate>Thu, 12 Feb 2026 13:59:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ssRN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>Last time in our <a href="https://www.mlops.wtf/t/agents-in-production">agents in production series</a> we looked at <a href="https://www.mlops.wtf/p/ai-agents-in-production-part-2-workflows">workflows</a>, and why agent systems need structure, tracing, and evaluation once they&#8217;re running in production. That was about keeping execution understandable.</em></p><p><em>This week the focus shifts to design. Even with good workflow discipline, there&#8217;s a point where a single agent is carrying too many responsibilities in one loop: interpreting the request, choosing tools, retrieving information, and composing the response. At that stage, adding more structure is not always enough.</em></p><p><em>Misha explores a practical alternative: splitting capabilities across specialised agents, letting them delegate to each other, and introducing clearer boundaries between responsibilities. He also looks at what changes once those agents need to collaborate across systems, including protocols like Agent2Agent (A2A)&#8230;</em></p><div><hr></div><blockquote><p>&#8220;Just give the LLM more tools.&#8221;</p></blockquote><p>What starts as a simple helping chat bot quickly becomes a single agent with an ever-growing tool belt. In an effort to give the agent greater capabilities we give it more and more tools, external context, and rules embedded in its system prompt. Eventually, the model loses track of countless tools available to it, starts forgetting what the goal in the initial user queries was, and well, generally falls apart.</p><p>However, there&#8217;s a solution in sight. How software engineers came up with microservice architecture patterns, similarly AI engineers are rediscovering very similar patterns in the age of agents &#8211; multi-agent systems, i.e. instead of a single agent with a set of toolsets, we build a set of specialised agents with task specific tools that each of them have.</p><p>In this post we&#8217;ll look at how and when to split the agents up, what A2A protocol gives us, and general multi-agent system considerations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ssRN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ssRN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 424w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 848w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ssRN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png" width="1456" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3279443,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ssRN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 424w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 848w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!ssRN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd43452d9-e554-4de8-8696-a4418148ba35_1744x1164.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>The &#8220;traditional&#8221; AI agent</h1><p>Even though AI agents are still a rather new concept, many people have an image of what an AI agent would look like. It&#8217;s a script with a loop in it that waits for user input, and passes this input to the LLM. As a result the LLM will tell the script to use one or more tools (which are in essence just function calls), and respond to the user with the results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CL_8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CL_8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 424w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 848w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 1272w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CL_8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png" width="667" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fde5f041-7621-44fe-89e1-8967f923a630_667x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:667,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CL_8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 424w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 848w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 1272w, https://substackcdn.com/image/fetch/$s_!CL_8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffde5f041-7621-44fe-89e1-8967f923a630_667x412.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The simplest case of an agent. An agent has a single system prompt and a single tool.</em></p><p>Consequently, the more things you want your agent to be able to do, the more tools and information you&#8217;ll need to give it, which leads to what&#8217;s known as context rot (<a href="https://research.trychroma.com/context-rot">Hong et al, 2025</a>) : as the context grows, the model becomes less reliable at picking out and using the right information at the right moment. <em>In plain terms, you keep adding &#8220;help&#8221;, and the system gets easier to confuse.</em></p><p>On top of that, many engineers would agree with applying the KISS (Keep It Simple, Stupid) principle here. We want our system not to be overly complex so it remains understandable and maintainable (amongst other things).</p><p>One could simply just split the large agent into a bunch of smaller separate agents and call it a day. But what if you actually need them to be able to fulfil a complex request that genuinely requires tools from different domains? In such a case, we&#8217;ll have to make the individual agents able to talk to each other.</p><p>A practical example here would be a personal assistant, that I want to find the best specialty coffee shops in a certain location. Below you can see a diagram outlining how the agent would look like, if we wanted to do it in a single piece.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZY1e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZY1e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 424w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 848w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 1272w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZY1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png" width="920" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:920,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113021,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZY1e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 424w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 848w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 1272w, https://substackcdn.com/image/fetch/$s_!ZY1e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc473c10d-a0b7-4e6c-9507-5a52b0ce324a_920x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>[The diagram of a single agent. The agent is given multiple toolkits: the Web Search Toolkit and the Map Toolkit. The system prompt consists of multiple rules, to cover different use cases.]</em></p><p><em>Further into the article, I will be talking about multi-agent systems, using terminology and notions from <a href="https://ai.pydantic.dev/multi-agent-applications/">Pydantic AI</a>, since I have the most experience with it out of all agentic frameworks. However, everything I&#8217;m covering is equally applicable to all other major frameworks too.</em></p><h1>Agentic Workflows</h1><p>To solve this context rot problem when trying to integrate many tools in an agentic system we can set up a workflow that utilises multiple agents with different capabilities. First and foremost, we split the single purpose agents (namely Map and Web Search agents) out of our large super-agent. Each of them has their own system prompt, and is given different tools.</p><p>On my diagram below, both agents use the same model under the hood, but it&#8217;s not really required. If we know that some model performs better on a specific task, we can easily swap it for a different model. The same goes for when we realise that using a smaller model when it&#8217;s more cost effective, and performance isn&#8217;t significantly worse.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SyUq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SyUq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 424w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 848w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 1272w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SyUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png" width="917" height="575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:575,&quot;width&quot;:917,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93423,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SyUq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 424w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 848w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 1272w, https://substackcdn.com/image/fetch/$s_!SyUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F062ad99f-795d-4d25-adc4-0f46ac852835_917x575.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the very simplest case then, we can write a simple script that asks the user for the input (e.g. what town they want to search in), and then calls appropriate agents in a deterministic order.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o5th!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o5th!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 424w, https://substackcdn.com/image/fetch/$s_!o5th!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 848w, https://substackcdn.com/image/fetch/$s_!o5th!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 1272w, https://substackcdn.com/image/fetch/$s_!o5th!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o5th!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png" width="576" height="905" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:905,&quot;width&quot;:576,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:103560,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o5th!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 424w, https://substackcdn.com/image/fetch/$s_!o5th!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 848w, https://substackcdn.com/image/fetch/$s_!o5th!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 1272w, https://substackcdn.com/image/fetch/$s_!o5th!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7a58aee4-20a8-4177-b2bd-2cf37c0f05a5_576x905.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>More often than not you can just stop here. You have a functional application that can solve a complex problem with ease. However, you can go a step further and ask &#8220;what if I need to answer more open-ended questions?&#8221;</p><p>I hear you say, &#8220;I&#8217;m planning a trip to Germany, and want to choose between Frankfurt and Berlin. Can you find the best specialty coffee shops in each of them? Plot them on the map. For each point add a short description why it&#8217;s notable&#8221;. Unfortunately, we can&#8217;t really put that into a neat workflow.</p><h1>Agent-to-agent delegation</h1><p>So what if we introduce a reasoning agent as the entry point? It figures out what to do with the query, splits it into sub-tasks, and delegates execution to specialised agents by calling them as tools.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PZE2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PZE2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 424w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 848w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 1272w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PZE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png" width="667" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:667,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49944,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PZE2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 424w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 848w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 1272w, https://substackcdn.com/image/fetch/$s_!PZE2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11da38d3-6774-4e55-9ea8-46c559e7da35_667x449.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now if prompted well, the agent we talk to directly, should be able to identify when it should delegate, and collectively they should solve the task quite effectively.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kgu0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kgu0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kgu0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png" width="1456" height="1556" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1556,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kgu0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 424w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 848w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!Kgu0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bdaf503-62b9-48ef-bc0d-51efab05b79f_1497x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6hdg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6hdg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 424w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 848w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6hdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png" width="1199" height="1600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1600,&quot;width&quot;:1199,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6hdg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 424w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 848w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 1272w, https://substackcdn.com/image/fetch/$s_!6hdg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a758645-0ad5-43d2-929d-7484abcaf332_1199x1600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T12j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T12j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 424w, https://substackcdn.com/image/fetch/$s_!T12j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 848w, https://substackcdn.com/image/fetch/$s_!T12j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 1272w, https://substackcdn.com/image/fetch/$s_!T12j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T12j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png" width="1456" height="834" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:834,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T12j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 424w, https://substackcdn.com/image/fetch/$s_!T12j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 848w, https://substackcdn.com/image/fetch/$s_!T12j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 1272w, https://substackcdn.com/image/fetch/$s_!T12j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd20803e0-4ca2-4333-a53d-77841a12c6e4_1600x916.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the very brief test on the screenshots above, you can see that it indeed can answer a reasonably complex query (doing multiple delegation calls before answering). And we can also talk a little about other benefits such a system has.</p><p>You can think of multi-agent systems in the same way as microservices. This architecture allows us to develop, test, and evaluate them independently &#8211; they are performing different functions after all. On top of that, agents don&#8217;t have to share an environment; the memory contents, including potentially sensitive information, can stay within the agent that needs it. Also, as previously mentioned, we don&#8217;t have to use the same LLM for every agent, which can be quite convenient.</p><h1>A2A Protocol and Talking Machines</h1><p>So far, our agents were defined within a single code base and run within the same process. This way of running multi-agent systems has its drawbacks: we cannot scale agents independently from each other, we can&#8217;t reuse agents in multiple different systems, we can&#8217;t use different languages and frameworks for different agents.</p><p>Now it&#8217;s a great point to talk about the <a href="https://a2a-protocol.org/latest/">A2A protocol</a>. Developed by Google, it&#8217;s a standardised agent communication protocol, aiming to address these problems. An agent provides an &#8220;agent card&#8221; that, in short, advertises what this agent can do. Other agents can then send messages to delegate tasks.  <em>(If you want the quick analogy: it&#8217;s closer to an API contract than a new model capability.)</em></p><p>The agents don&#8217;t have to be written in the same framework, run on the same machine, or even be maintained by the same people. As long as there&#8217;s network connectivity between two agents that implement A2A protocol, they can talk to each other.</p><p>Even though it&#8217;s far from reality &#8211; you don&#8217;t see specialised agents with A2A enabled everywhere yet &#8211; we can imagine a world where our pondering agent talks to other specialised agents, some of which could be managed by others. Take for example our Map Agent, which essentially uses some map API under the hood. In theory, the map service provider could build and run the agent themselves, and make it available via A2A for other multi-agent systems to use.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wtbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wtbm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 424w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 848w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 1272w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wtbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png" width="936" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81859,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/187730845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wtbm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 424w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 848w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 1272w, https://substackcdn.com/image/fetch/$s_!wtbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b239131-271d-4f9c-96c5-b0c7b295dfd2_936x485.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Where we ended up</h1><p>Thus we started with a simple idea of splitting up a mono-agent into sub-agents for the purpose of reducing context (and hence improving quality of responses), and ended up with a few other benefits on top.</p><p>Of course, there are drawbacks. First and foremost, we practically doubled the number of moving parts in our system, which doubles the number of places things can go wrong.</p><p>Also, even though this arrangement gives us greater flexibility, it&#8217;s a greater number of hyperparameters. It makes things harder to keep track of when experimenting, albeit not impossible: MLflow and Opik have very comprehensive tracing capabilities, allowing you to record all the queries, tool calls, and responses. If you set up your evals right, the sky is the limit.</p><p>Nevertheless, multi-agent systems appear to be a very powerful pattern akin to microservice architecture. And I&#8217;m very keen on seeing more examples of publicly available agents that implement A2A, making agent collaboration more accessible. As with any engineering pattern, I would be cautious to start with a multi-agent system if a single simple agent would suffice, as it would overcomplicate things. However, it&#8217;s useful to keep in mind that such separation of capabilities is there when necessary &#8211; and I hope, the article above illustrates when it might be the case.</p><p></p><p><em>Misha is a Senior MLOps Engineer here at Fuzzy Labs. His background spans computer science and bioinformatics. He studied at the University of Manchester and completed his Master&#8217;s at Imperial College London. Outside of work, he&#8217;s always on the hunt for the best cup of coffee, which may or may not have inspired all the examples in this MLOps.WTF edition.</em></p><div><hr></div><h2>And finally</h2><h3>What&#8217;s coming up</h3><p><a href="https://mlopswtf-event-8.eventbrite.co.uk">Our next MLOps.WTF meetup</a> will be the 25th of March, back on our home turf at DiSH in Manchester.</p><p>This meet up will be focusing on Agentic AI in financial services, with an emphasis on how these systems are built, monitored, and governed once they&#8217;re running in regulated environments.</p><p>3 exciting speakers to be announced. Make sure you get your ticket! </p><p><strong>&#128467;&#65039; Wednesday 25th March &#8212; Manchester</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-8.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-event-8.eventbrite.co.uk"><span>Get my ticket</span></a></p><div><hr></div><h3><strong>About Fuzzy Labs</strong></h3><p>We&#8217;re Fuzzy Labs, a Manchester-based MLOps consultancy founded in 2019. We&#8217;re engineers at heart, and nerds that are passionate about the power of open source.</p><p>Want to join the team? You&#8217;re in luck!</p><p><strong>Open roles:</strong></p><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all current vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all current vacancies</span></a></p><p></p><p><em>Not subscribed yet? Why not? All this MLOps goodness straight to your inbox!</em></p><p><em>The next issue will be the next in our agents in production series, with Oscar taking on Evaluating AI agents. Definitely one to watch out for.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p><em><br>Or equally, why not follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a> to see more BTS bits and pieces, alongside updates around future events and thought pieces &#127813;.</em></p>]]></content:encoded></item><item><title><![CDATA[Edge AI: shipping models into the real world]]></title><description><![CDATA[MLOps.WTF Edition #24]]></description><link>https://www.mlops.wtf/p/edge-ai-shipping-models-into-the</link><guid isPermaLink="false">https://www.mlops.wtf/p/edge-ai-shipping-models-into-the</guid><dc:creator><![CDATA[Rhiannon]]></dc:creator><pubDate>Thu, 29 Jan 2026 15:10:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!x33l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Our MLOps.WTF meetup #7, where Arm&#8217;s new office proved (1) extremely nice, (2) &#8220;labyrinthine&#8221;, and (3) capable of producing the most professional safety briefing we&#8217;ve ever had.</em></p><p>There&#8217;s a point in most ML systems where the cloud stops feeling abstract and starts feeling expensive. Not just in money, but in time. In latency. In the number of things that have to go right before someone can act on an insight.</p><p>The MLOps landscape was built around cloud deployment. Elastic compute. Predictable networking. Logs you can reach. Rollbacks that are mostly just a button. In that world, you can afford to park certain questions for later:</p><ul><li><p>How big will my models get</p></li><li><p>How fast do I need inference to run</p></li><li><p> What&#8217;s my pipeline for getting data back for future training runs?</p></li></ul><p>The edge removes your ability to postpone them.</p><p>When the model sits next to the camera, the sensor, the machine, the crop, the doorbell, reality really is knocking at the door.</p><p>So the question hanging over MLOps.WTF #7 was simple enough: when does it stop making sense to send everything to the cloud?</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/902fcd58-0779-4f6c-ac01-867bea930af9_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/729e50e7-2e2a-4ac5-b949-f72384fcceb9_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/400211ce-be2b-487c-af2b-b03d89884369_2633x4032.jpeg&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05c71094-9286-4cfe-a70a-a4fb8b5601c7_3024x4032.heic&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0db74186-daee-4602-8b8b-9403f09bac02_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/809a8547-6b5f-4b51-8316-52efa8bc4ee5_3024x4032.heic&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a27dc027-12be-4d0c-b835-d7e9b306facb_1456x964.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x33l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x33l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!x33l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!x33l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!x33l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x33l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg" width="1456" height="1096" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1096,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:662393,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/186181618?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x33l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!x33l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!x33l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!x33l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2758a5a-2f9c-450b-908c-0cb8fcc7f346_4080x3072.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Raj (Fotenix):From Cloud to Edge</strong><em><strong>: The Next Phase of Scalable Crop Intelligence</strong></em></h2><p>Raj opened by grounding the room in the realities of food production:</p><p>In the UK, most fruit is imported, and a large proportion of vegetables too. Fresh produce, especially things like leafy greens, degrades as it moves through long supply chains. Nutritional value drops. Time and temperature matter. Many greenhouses are old. Labour is short. Margins are thin. On a &#163;2.20 bag of apples, growers make about three pence. When intervention is late, it&#8217;s lost yield. By the time you&#8217;ve &#8220;fixed it later&#8221;, the opportunity has usually gone.</p><p>That&#8217;s the environment Fotenix operates in.</p><p>They build camera systems for greenhouses that help growers spot plant stress and decide when to intervene. You deploy them quickly and start getting useful signals fast. Today, the setup is cloud-first: images go up, pipelines run, metrics come out.</p><p>The problem shows up at scale. Each site generates huge volumes of data. In rural environments, with limited bandwidth, that turns insight into something that arrives too late to act on. As Raj put it, the issue isn&#8217;t compute. It&#8217;s bandwidth, and more specifically the assumption that every byte needs to travel before it becomes useful.</p><p>So they&#8217;ve become selective when looking at what to move to the edge and what to keep in the cloud. Some things earn their place close to the data: basic quality checks, pulling out the parts of an image that matter, turning pictures into signals. Other things don&#8217;t: training, cross-site analysis, anything that needs global context or doesn&#8217;t benefit from immediacy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dPHN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dPHN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 424w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 848w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 1272w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dPHN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png" width="1400" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:284710,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/186181618?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dPHN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 424w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 848w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 1272w, https://substackcdn.com/image/fetch/$s_!dPHN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a2a3c02-2590-4956-81a4-1c60b06d632e_1400x794.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hardware isn&#8217;t the limiting factor. The sites already run capable devices. The harder part is operating them. Once you have hundreds of devices in the field, reliability, observability, and fleet management become the real constraints.</p><p>Before moving more computation outward, Fotenix is putting effort into the fundamentals. Lightweight runtimes. Local observability. Remote fleet management. The goal is to make sure that when computation does move closer to the data, the system doesn&#8217;t become blind or fragile.</p><p>Summed up simply: edge is a consequence of maturity. If edge feels risky, that usually means the system isn&#8217;t ready yet.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=HpcekyyQViw&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=HpcekyyQViw"><span>Watch Full Talk</span></a></p><h2><strong>Sam (Fuzzy Labs): Shipping ML to the Edge: </strong><em><strong>A Practical Guide</strong></em></h2><p>Sam followed by zooming in on what all of this looks like from an MLOps point of view.</p><p>He opened by admitting that speaking into a microphone still makes him feel like he&#8217;s on X Factor. Thankfully, instead of &#8220;Edge of Glory&#8221;, he walked through where edge changes the MLOps lifecycle.</p><p>To keep things concrete, he used the tongue-in-cheek&#8221;Don&#8217;t Ring&#8221; doorbell as  case study, an anonymised version of a real Fuzzy Labs project. The problem: they had a facial recognition system that worked well, except it failed when people weren&#8217;t looking at the camera, or stood too close, or too far away. The solution: build a separate model that detects unusable images and filters them out before they hit the facial recognition system. The model is small and efficient, so it runs it on the device itself without killing the battery.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R9Si!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R9Si!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 424w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 848w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 1272w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R9Si!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png" width="1406" height="790" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:790,&quot;width&quot;:1406,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1474258,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/186181618?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R9Si!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 424w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 848w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 1272w, https://substackcdn.com/image/fetch/$s_!R9Si!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F133feb90-69cd-4062-a48d-38e2abf7d42f_1406x790.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Models need to be small. Power matters. Privacy often means you don&#8217;t get access to real data.</p><p>In this case, there was no data at all. Open datasets can get you moving. Synthetic data helps too - Gemini generates realistic training images for about two pence each. But Sam was clear about the catch: synthetic data helps you train. It doesn&#8217;t tell you whether the model will behave in the real world. If your evaluation data doesn&#8217;t match what the device actually sees, you&#8217;ll find out later.</p><p>Experiment tracking came up next. He put up a slide most people saw themselves in immediately:</p><blockquote><p>&#8220;We&#8217;ve probably all been there at some point with something like <em>my_model_V3_best_USE_THIS_ONE_final</em>.&#8221;</p></blockquote><p>Deploy the wrong model to the cloud and you redeploy. Deploy it to a device you can&#8217;t easily reach and you are running around collecting devices or sending apology emails to customers.. Versioning, lineage, knowing what&#8217;s running where. Things you can sometimes get away with being loose about in the cloud start to matter much sooner.</p><p>Model optimisation followed the same pattern. Every size reduction is a trade-off. You only really understand those trade-offs if you measure performance in the conditions the model will actually run in, not just against training metrics.</p><p>Deployment doesn&#8217;t get easier either. Different devices, different architectures, different toolchains. Sam mentioned tools like PlatformIO as a way to avoid rebuilding everything from scratch each time.</p><p>The problems Sam walked through aren&#8217;t new. Data quality, experiment tracking, model optimisation, deployment complexity - they all exist in cloud systems too. What edge removes is the buffer that makes cloud mistakes recoverable. Deploy the wrong model to cloud and you redeploy in minutes. Deploy it to a device you can&#8217;t reach and you need physical access or a complex remote rollback. You don&#8217;t need entirely new skills for edge ML. You need to be much more careful about the fundamentals you already know.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=LV4ehcGwD3c&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=LV4ehcGwD3c"><span>Watch Full Talk</span></a></p><h2><strong>Isabella Gottardi (Arm): From Edge to Everywhere: </strong><em><strong>Arm Machine Learning Inference Advisor's journey from NPU to GPU and beyond.</strong></em></h2><p>Isabella from Arm focused on the hardware layer with a live demo of MLIA.</p><p>When Arm first became involved in machine learning, the assumption was cloud-centric: send an image to the cloud, run inference, return the result. Today, inference runs in lots of places. Cloud platforms. IoT devices. Phones. Automotive systems. Dedicated accelerators. The same model can technically run across all of them. Performance varies wildly.</p><p>&#8220;Portability of models does not mean portability of performance.&#8221;</p><p>That&#8217;s where MLIA, the Machine Learning Inference Advisor, comes in. A way of checking compatibility and performance before you&#8217;ve committed to architectural decisions that are expensive to reverse.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A4sN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A4sN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 424w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 848w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 1272w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A4sN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png" width="1402" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1402,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:409325,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/186181618?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A4sN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 424w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 848w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 1272w, https://substackcdn.com/image/fetch/$s_!A4sN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ccd8f72-be9c-4b20-acc7-b607f5634930_1402x788.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The usual flow is familiar: gather data, train, optimise, deploy. Performance issues tend to show up late, when changing course is painful. MLIA shifts that discovery earlier, while trade-offs are still cheap.</p><p>Isabella showed this live. Pick a target environment. Check whether the model will run. Look at how it behaves. Try a variant. Compare the results. The tool gives you information before you&#8217;ve committed to an approach that won&#8217;t work.</p><p>With Arm&#8217;s recently announced neurotechnology, that problem only gets more interesting. Models will behave differently on different GPUs, and especially on specialised edge device hardware. The future is heterogeneous systems, where performance depends on memory layout, data movement, and how work is split across processors.</p><p>The takeaway: inference performance isn&#8217;t something you discover at the end. It&#8217;s a design input.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=fXGDouu3J2Q&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=fXGDouu3J2Q"><span>Watch Full Talk</span></a></p><p>Sources:</p><ul><li><p><a href="https://github.com/arm/mlia">https://github.com/arm/mlia</a></p></li></ul><p>Pypi package:</p><ul><li><p><a href="https://pypi.org/project/mlia/">https://pypi.org/project/mlia/</a></p></li></ul><div><hr></div><h2><strong>Rounding up</strong></h2><p>What we&#8217;re taking away from MLOps.WTF #7:</p><ul><li><p><strong>Edge forces immediate answers to questions you could defer in the cloud.</strong> Energy use, model size, and bandwidth constraints aren&#8217;t problems for later. They&#8217;re design constraints from day one.</p></li><li><p><strong>Operational maturity matters more than hardware capability.</strong> The limiting factor isn&#8217;t compute power. It&#8217;s whether you can observe, update, and recover when systems are in the field.</p></li><li><p><strong>Be selective about what moves.</strong> Not everything belongs at the edge. Move what benefits from immediacy and local processing. Keep what needs global context or doesn&#8217;t gain from faster turnaround.</p></li><li><p><strong>Design for hardware reality early.</strong> The same model performs differently across different chips. Check compatibility and performance before committing, while changing course is still cheap.</p></li></ul><p>As Raj put it, edge is something you arrive at when your infrastructure can support it.</p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p>We&#8217;re Fuzzy Labs. Manchester-rooted open-source MLOps consultancy, founded in 2019. Helping organisations build and productionise AI systems they genuinely own.</p><p>We&#8217;re also hiring.</p><h2>Open Roles</h2><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/private-sector-lead">Private Sector Lead</a></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all current vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all current vacancies</span></a></p><p></p><p><strong>Liked this?</strong> Forward it to someone wrestling with edge deployments.</p><p><strong>Not subscribed yet?</strong> Button link below. Couldn&#8217;t be easier.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Production (Part 2): Workflows]]></title><description><![CDATA[MLOps.WTF Edition #23]]></description><link>https://www.mlops.wtf/p/ai-agents-in-production-part-2-workflows</link><guid isPermaLink="false">https://www.mlops.wtf/p/ai-agents-in-production-part-2-workflows</guid><pubDate>Thu, 15 Jan 2026 10:44:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1rjl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>This episode is brought to you by Shubham Gandhi, MLOps Engineer and run club enthusiast at Fuzzy Labs.</em></p><p><a href="https://www.mlops.wtf/p/ai-agents-in-production-all-this">Last episode</a>, Matt introduced some of the challenges teams can expect to encounter when productionising AI agents. Agentic applications fundamentally differ from traditional software and ML systems in how a request is executed end-to-end.</p><p>Rather than a single prediction, agentic systems run multi-step workflows with iterative loops of reasoning, action, and state. Agents maintain context, make decisions conditionally, and adapt their behavior as execution unfolds. A single request can cascade into many tool calls, data retrievals, and intermediate decisions. Each of these steps introduces new failure modes, dramatically expanding the surface area where things can go wrong.<br><br>Agentic applications introduce a challenge in how to observe, debug, and evaluate such systems. How do you build confidence in a system that is inherently non-deterministic? In this newsletter, I&#8217;ll share what we&#8217;ve learnt about managing agents and workflows in Fuzzy Labs&#8217; customer work.</p><h3><strong>A workflow by any other name</strong></h3><p>But first, we need to talk about terminology. Because agentic AI is such a new field, it&#8217;s inevitable that different people will use the same words to mean subtly different things. Unfortunately, <em>workflow</em> has different meanings depending on who you ask.</p><p>In our previous edition, we discussed <em>agentic workflows</em>, and what we really meant by that was the control loop that sits behind an agent. For each loop iteration, the agent&#8217;s model is given a prompt along with some context, and it is given the opportunity to take an action &#8212; like calling a tool, or updating its memory.</p><p>In a recent article from Anthropic &#8212; <a href="https://www.anthropic.com/engineering/building-effective-agents">Building effective agents</a> &#8212; a workflow is defined very differently. For Anthropic, workflows and agents are mutually exclusive concepts. Quoting the article:</p><blockquote><p>&#8220;Workflows are systems where LLMs and tools are orchestrated through predefined code paths.</p><p>Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.&#8221;</p></blockquote><p>The distinguishing factor is autonomy: workflows can&#8217;t make decisions about what to do next, but agents can. For this article, we&#8217;re adopting Anthropic&#8217;s definitions.</p><h3><strong>Workflows vs agents (to be, or not to be)</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://venechkaeror.artstation.com/projects/g1m9e" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1rjl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 424w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 848w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 1272w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1rjl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1504914,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://venechkaeror.artstation.com/projects/g1m9e&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/184642131?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1rjl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 424w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 848w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 1272w, https://substackcdn.com/image/fetch/$s_!1rjl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54c9fb92-f1e7-4724-b427-dd9259b2e3bb_1685x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>LLM workflows are predictable and consistent if the task is well-defined. Some examples include categorising customer service queries, translating a document, or generating or summarising reports from a database. The patterns range from forwarding queries to an LLM and getting back a response, to more complex chaining and routing involving multiple steps, or stateful multi-turn workflows.</p><p>Agentic patterns, on the other hand, involve autonomous loops of LLM reasoning and tool usage, where the system dynamically decides what steps to take in order to achieve a goal. Some examples include a sales coaching assistant, open-ended research, or multi-tool problem solving. There is no predefined code-path; agents go in an endless loop using their available tools and knowledge to collect all the necessary information required to complete a task.</p><p>Even though agents are fashionable right now, not all LLM applications <em>need</em> to use an agentic pattern. There are lots of simple workflows that may solve your use case.</p><p>There are various options on how to implement these patterns. Popular libraries for implementing workflows include <a href="https://www.langchain.com/">Langchain</a>, <a href="https://www.llamaindex.ai/">LlamaIndex</a>. For agents, we have used the open source <a href="https://ai.pydantic.dev/">Pydantic AI</a> library in most of our work, but besides that there are lots to choose from such as <a href="https://google.github.io/adk-docs/">Google ADK</a>, <a href="https://www.langchain.com/langgraph">LangGraph</a>, <a href="https://www.crewai.com/">CrewAI</a> and <a href="https://www.parlant.io/">Parlant</a>.</p><h3><strong>Reproducibility and evaluation</strong></h3><p>When we deploy agents, we pay particular attention to reproducibility: for any actions and decisions made by an agent, we want the ability to look back and understand <em>how</em> the agent got there. Without that, debugging becomes increasingly difficult, and moreover we have no ability to explain outcomes or measure performance.</p><p>By introducing experiment tracking, we can version the prompt, dataset, models, code and various metrics. It also allows us to keep track of traces. Traces are records of all actions, messages, tool calls, reasoning, and intermediate communications, tracked across the lifecycle of a request. Traces are invaluable for debugging and provide insight into the actions an LLM takes in generating a response. The popular open source tools that we have used are <a href="https://mlflow.org/">MLFlow</a> and <a href="https://langfuse.com/">Langfuse</a>.</p><p>With a reproducible foundation in-place, the next critical component we need is an evaluation framework. The idea here is to perform error analysis, collect 50 - 100 examples of where the application is failing, ideally through real user conversations. If you don&#8217;t have any data, you can also generate synthetic data to get started.</p><p>These examples serve as an evaluation dataset. The evaluation process for agentic workflows can be broken down into two steps. First, we check whether the overall task was successful. Second, we perform a step-level diagnosis: checking whether tools were selected appropriately and whether the agent recovers from failures. For workflow-based patterns, error analysis needs to be targeted at each stage of the workflow. For more advanced cases, LLM-as-judge evaluators can also be included. There are generic LLM-specific evaluation tools such as <a href="https://deepeval.com/">DeepEval</a>, <a href="https://www.deepchecks.com/">Deepchecks</a> and <a href="https://docs.ragas.io/en/stable/">Ragas</a>.</p><p>To summarise, by this point we have an orchestrator, an application-specific evaluator, and an LLM-specific experiment tracker with tracing for debugging. Together, these enable us to confidently iterate and improve the performance of the agentic application. Because LLMs are susceptible to hallucinations and prompt injection, one common outcome of evaluation is the addition of guardrails around inputs and outputs to catch and flag issues early.</p><h3><strong>Production considerations</strong></h3><p>In this article, we&#8217;ve discussed some of the fundamentals of MLOps as they apply to tracing, reproducibility, and evaluation for agentic systems.</p><p>There are plenty of other considerations for productionising agentic applications. Defining clear success metrics at the start of a project is important if we want to meaningfully evaluate performance - the frameworks don&#8217;t do the thinking for us here. As a project evolves, we also need to consider increased complexity in observability and telemetry, alongside more sophisticated guardrails and safety controls. On top of that, the standard set of application monitoring still applies.</p><h3><strong>What&#8217;s next? (all the world&#8217;s a stage)</strong></h3><p>Over the next few editions, we&#8217;re going to dive into some of the most important topics in agentic AI and AgentOps. We&#8217;ll cover multi-agent systems and agent-to-agent protocols, explore evaluation and testing in greater depth, look at fully self-hosted agentic applications, and cover safety and security &#8212; which may turn out to be the most important emerging topic in this field.</p><p>Agents are still very new technology, and we&#8217;re constantly learning and refining our approach to AgentOps. We&#8217;re keen to hear your own experiences and lessons learned, so please get in touch and let us know.</p><p><em>Shubham, (the perfect dude) is a master of AI with a passion for machine learning engineering and MLOps. He holds a Master&#8217;s degree in AI, enjoys running, and believes the best solutions are usually the simplest ones.</em></p><div><hr></div><h2><strong>And finally</strong></h2><h3><strong>What&#8217;s coming up</strong></h3><p><a href="https://mlopswtf-event-7.eventbrite.co.uk">Our next MLOps.WTF meetup</a> is happening on 22nd January 2026, hosted by Arm - and it&#8217;s now sold out!</p><p>If you&#8217;ve got a ticket but can no longer make it, please cancel so someone on the waiting list can take your place. And if you missed out, it&#8217;s still worth joining the waiting list&#8230; just in case.</p><p>This one&#8217;s an edge AI special, focused on what actually changes when models move out of the cloud and into the real world: tighter constraints, harder debugging, and failure modes you don&#8217;t see coming until you ship. We&#8217;ll be hearing practical stories from Arm, Fotenix, and Fuzzy Labs on what it takes to run edge AI systems day to day.</p><p><strong>&#128467;&#65039; Thursday 22nd January 2026 &#8212; Manchester</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-7.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Join the waiting list&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-event-7.eventbrite.co.uk"><span>Join the waiting list</span></a></p><p>We&#8217;re also headed to our first BIG event,  <strong><a href="https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/">AI &amp; Big Data Global</a></strong><a href="https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/"> on </a><strong><a href="https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/https://www.ai-expo.net/global/agenda/day-free-gold-ai-developer-conference-from-prototype-to-production/">3-4 February</a></strong>. Matt will be joining a panel at the conference, digging into what it really takes to take AI systems from prototype to production.</p><p><strong>&#128467;&#65039; 4&#8211;5 Feb 2026 &#8212; Olympia London</strong></p><p>If you&#8217;ll be there, come say hello, and if you show this newsletter, we&#8217;ll even give you a bottle of sauce. Secret password: IReadTheNewsletterUntilTheEnd.</p><div><hr></div><h3><strong>About Fuzzy Labs</strong></h3><p>We&#8217;re Fuzzy Labs, a Manchester-based MLOps consultancy founded in 2019. We&#8217;re engineers at heart, and nerds that are passionate about the power of open source.</p><p>And right now, we <strong>really are hiring</strong>. We&#8217;re growing fast &#8212; and we&#8217;re on the lookout for people to join our team.</p><p><strong>Open roles:</strong></p><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><strong><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></strong></p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers#job-vacancies&quot;,&quot;text&quot;:&quot;See all current vacancies&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/careers#job-vacancies"><span>See all current vacancies</span></a></p><p>If you, or someone you know, want to build serious systems with people who can happily spend 30 minutes arguing about observability <em>and</em> espresso extraction, we&#8217;d love to hear from you.</p><p><em>Not subscribed yet? You probably should be. The next issue will be our <strong>MLOps.WTF meetup playback</strong> and then, after that we&#8217;ll be diving deeper into <strong>agents in production</strong>, starting with multi-agent systems. Or follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn</a> to see what we&#8217;re up to&#129782;.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[A merry MLOps.WTF wrap up]]></title><description><![CDATA[Watch now | MLOps.WTF Edition #22]]></description><link>https://www.mlops.wtf/p/a-merry-mlopswtf-wrap-up</link><guid isPermaLink="false">https://www.mlops.wtf/p/a-merry-mlopswtf-wrap-up</guid><dc:creator><![CDATA[Matt Squire]]></dc:creator><pubDate>Tue, 23 Dec 2025 10:55:59 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/182405983/c5a60dcd2a8987c28c7e3ccb3e720316.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<h1><strong>A message from Matt Squire</strong></h1><p>Did you know 2025 is a square year?</p><p>Take 45, square it, and you get 2025. If you&#8217;re wondering how special that is, the next square year is 46&#178; &#8211; which isn&#8217;t until 2116. A bit of a wait.</p><p>This year on MLOps.WTF we&#8217;ve shared 13 articles. Topics ranged from <a href="https://www.mlops.wtf/p/lets-build-a-sovereign-llm">sovereign large language models</a> &#8211; how to build your own, and why it matters &#8211; to AI safety, including the slightly terrifying idea of <a href="https://www.mlops.wtf/p/a-deep-dive-into-deepseek">using neural networks in safety-critical applications</a>. We talked about <a href="https://www.mlops.wtf/p/my-type-on-paper-the-future-of-software">vibe coding</a> and <a href="https://www.mlops.wtf/p/matt-squire-are-we-the-last-programmers">whether any of us will have a job in 2026.</a> And of course <a href="https://www.mlops.wtf/p/mlopswtf-5-newsletter-14">agents</a>, which have been a big focus for us.</p><p>We care about agents because we want to understand how to productionise them &#8211; and how to keep them running in production. We&#8217;ll have a lot more to say on this in 2026, including workflow management and multi-agent systems, evaluation approaches, self-hosting agentic models, and the safety and security work that comes with all of the above.</p><p>Thanks to all of our authors besides me &#8211; we&#8217;ve had articles from Danny, James, Sam, Sav, Rhiannon, and Tom.</p><p>We&#8217;ve also run five live MLOps.WTF events here in Manchester this year. You can find all the videos on <a href="https://www.youtube.com/channel/UCJwQVWdWOfK2XNAdD7tAT1g">our YouTube channel.</a></p><p>If you&#8217;d like to come to the next one, it&#8217;s on 22 January. It&#8217;s being held at Arm&#8217;s new office in Manchester, and we&#8217;re focusing on edge AI &#8211; basically any situation where we want to train, optimise, deploy, and manage models on specialised hardware.</p><p>That includes cases with power constraints, where we want models to use as little energy as possible. Or memory constraints, where we need models to be small enough to fit. Or situations where we want to do most inference locally, close to the data, and only send summaries up to the cloud &#8211; because latency, bandwidth, cost, or privacy says we should.</p><p>We&#8217;ll cover all of those topics on 22 January with our edge AI MLOps special. If you&#8217;d like to come along, here&#8217;s the sign up link.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-7.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Tickets for 22nd Jan&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-event-7.eventbrite.co.uk"><span>Tickets for 22nd Jan</span></a></p><p>I couldn&#8217;t find anything mathematically interesting about 2026. If you&#8217;ve got one, send it in a message or leave a comment &#8211; we&#8217;d genuinely love to hear it.</p><p>In the meantime, have a great holiday, and we&#8217;ll see you in 2026!</p>]]></content:encoded></item><item><title><![CDATA[AI Agents in Production: starting with the fundamentals]]></title><description><![CDATA[MLOps.WTF Edition #21]]></description><link>https://www.mlops.wtf/p/ai-agents-in-production-all-this</link><guid isPermaLink="false">https://www.mlops.wtf/p/ai-agents-in-production-all-this</guid><dc:creator><![CDATA[Matt Squire]]></dc:creator><pubDate>Thu, 04 Dec 2025 12:04:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!afE_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,<br><br><em>Matt is back! Episode #21 is brought to you by Matt Squire, CTO, Co-Founder, Fuzzy Labs.</em></p><p><strong>How do we deploy software that thinks for itself?</strong></p><p>It&#8217;s a common theme in this newsletter that things change quickly in the world of MLOps. According to Google Trends, the term itself only gained popularity in 2019. Back then, the hard thing we were all grappling with could be summarised like this: how do we deploy and maintain software that&#8217;s fundamentally non-deterministic?</p><p>This description applies to all the traditional ML things that we know and love, like recommender models, sentiment scoring, image segmentation, etc. And it applies in the same way to the first wave of generative AI applications, such as RAG (<a href="https://www.mlops.wtf/p/you-could-have-invented-rag">see issue #4</a>). Non-determinism in ML comes from a few places; randomness during training, gradual data drift during inference, and (especially with LLMs) stochastic generation used as part of producing a model output.</p><p>That stuff is hard enough to deal with, but agents are much worse, because they add an entirely new dimension to the challenge: agents can reason and follow complex workflows. They can act, and interact with the world in ways that compound unpredictability. A traditional ML model makes a prediction <em>on request</em>, and then it sits there waiting for the next request. But an agent makes a prediction, then takes an action, observes the result, and it can keep going, potentially dozens of times in a single run.</p><p>As MLOps practitioners, how do we approach this challenge? In this article I&#8217;ll introduce some of the emerging ideas and tools for running AI agents in production &#8212; AgentOps, if you like.</p><p><strong>Agentic workflows: or fancy </strong><em><strong>while</strong></em><strong> loops</strong></p><p>To begin with, I&#8217;d like to demystify this word &#8216;agent&#8217;. The term has been around in AI research since the 1980s, but it was researchers like Pattie Maes at MIT&#8217;s Media Lab who brought it into the mainstream. When Maes launched her Software Agents Group in 1991, she defined an agent as a program that could act autonomously on behalf of a user or another program.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!afE_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!afE_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 424w, https://substackcdn.com/image/fetch/$s_!afE_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 848w, https://substackcdn.com/image/fetch/$s_!afE_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 1272w, https://substackcdn.com/image/fetch/$s_!afE_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!afE_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp" width="1300" height="1300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1300,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148138,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/180691143?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!afE_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 424w, https://substackcdn.com/image/fetch/$s_!afE_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 848w, https://substackcdn.com/image/fetch/$s_!afE_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 1272w, https://substackcdn.com/image/fetch/$s_!afE_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b9b377b-4fa8-477d-8341-a21707aa0fc8_1300x1300.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>(Photo credit: Susan Lapides, 2013 - Pattie Maes)</em></p><p>Nowadays, &#8216;agent&#8217; refers to a specific way of using large language models with tools&#8212;and the mechanism is remarkably simple.</p><p>Suppose we want an AI to assist with booking meetings. You could prompt an LLM with everyone&#8217;s calendar slots and ask it to reply with a suitable time. That works, but what if we need more information from the user, or additional data from the calendar system?</p><p>Instead, give the LLM tools and let it make its own choices:</p><p><em>&#8220;You are a calendar booking assistant. The user wants to book a meeting for Alice and Bob this week. You may:</em></p><p><em>a) you can ask to see a user&#8217;s calendar: &lt;tool:calendar,user name&gt;;</em></p><p><em>b) you may ask the user for additional clarification: &lt;ask:question&gt;</em></p><p><em>c) propose a meeting time along with &lt;done&gt;&#8221;.</em></p><p>To make this work, we need a program &#8212; let&#8217;s call it a workflow orchestrator &#8212; that interprets the LLM&#8217;s responses and acts on them. After each action, the runtime calls the LLM again with the results: <em>&#8220;On the last turn you asked to see Alice&#8217;s calendar. Here are her available slots: [...]&#8221;</em>. The LLM decides what to do next&#8212;maybe it needs Bob&#8217;s calendar too, or maybe it can propose a time.</p><p>This continues in a loop until the LLM returns &lt;done&gt;.</p><p>That&#8217;s the core idea: a while loop where the LLM decides what happens next. This basic structure is what powers our coding assistants, research tools, etc. By giving the LLM the power to pursue a goal autonomously and make decisions based on what it observes at each step, we end up with an <em>agent.</em></p><p>By the way, if you&#8217;re familiar with the concept of <em>continuation passing </em>in programming, then you&#8217;ll notice some similarities here!</p><p><strong>From loops to workflows</strong></p><p>The calendar booking example above illustrates the concept, but in practice it&#8217;s very limited. What happens when the LLM makes a mistake and needs to backtrack? What if you want multiple agents working in parallel, perhaps one checking calendars while another drafts a meeting agenda? And what if a human needs to be <em>in the loop</em>, say by approving the proposed time before committing?</p><p>What we really need is a <em>workflow framework.</em> Tools like <a href="https://github.com/langchain-ai/langgraph">LangGraph</a> (from the makers of LangChain) and <a href="https://github.com/crewAIInc/crewAI">CrewAI</a> take the basic while-loop pattern and add the structure you need for production: state management, branching logic, error recovery, and orchestration of multiple agents or steps.</p><p>LangGraph, for instance, lets you define your agent as a directed graph where nodes represent actions (e.g. call the LLM, invoke a tool, wait for human input) and edges represent transitions between them. The framework can persist state, so if your agent fails during a complex process, you can resume from where it left off instead of starting again.</p><p><strong>Using tools</strong></p><p>Workflows are what enables an agent to reason sequentially, i.e. to work through a task in multiple steps. But our agents also need to <em>observe</em> and <em>act</em>, and to do that, they need access to tools. Tools give agents access to things like databases, file storage, and APIs. In the calendar booking example, we glossed over exactly how tool calling works, so let&#8217;s take a closer look at that now.</p><p>For an LLM to make use of tools, we need to agree on two things: firstly, how do we describe a tool to the model? Secondly, when the model wishes to invoke a tool, how should it communicate its intentions back to us?</p><p>In other words, we need a protocol, and Anthropic&#8217;s MCP (model context protocol) has become the standard way to describe and interface with tools. Each tool has an MCP <em>server</em> which knows how to talk to that tool. Workflow frameworks use an MCP <em>client</em> to talk to these servers.</p><p>The standardisation that MCP brings is important particularly because it means we can swap out tools without re-writing the agent, and different workflow frameworks are now interoperable with the same tool integrations.</p><p><strong>Deploying agents</strong></p><p>At first glance, deployment looks straightforward. Components related to workflow orchestration, as well as your MCP servers, need to be deployed, scaled, and monitored. We need infrastructure, CI/CD pipelines, central logging&#8230; so far, so good.</p><p>In traditional ML deployments we usually assume a single inference step. So, you send data to a model, get a prediction back, and you&#8217;re done. But as we&#8217;ve seen, that&#8217;s not how agents work. A single request from a user might trigger ten individual LLM calls, along with three API requests, and a database operation.</p><p>That means your deployment needs to handle long-running processes, manage state between steps, and deal with failure gracefully. What happens if our agent is half-way through booking a meeting and the calendar API times out? Should it retry? How many times? In the end do we fail the whole workflow, or save and resume later on?</p><p>We can make life even harder by introducing <em>multiple</em> agents that need to coordinate in order to accomplish more complex goals. How do these agents share state and agree on task orderings?</p><p>The good news is that these aren&#8217;t new problems in software engineering. Ultimately, we&#8217;re talking about the challenges of <em>distributed systems</em>. Statefulness is the enemy, so we need to avoid it as much as possible. MCP servers, for example, should most definitely be stateless.<em> </em>Workflows are stateful by definition, and Frameworks like LangGraph include helpful features like state persistence and recovery.</p><p>For the multiple agent case, there are emerging standards designed to help with the coordination problem &#8212; in particular Google&#8217;s <a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/">Agent-to-Agent protocol</a>.</p><p><strong>Observing and monitoring agents</strong></p><p>Once our agents are running in production, we need to understand what they&#8217;re actually doing.</p><p>Traditional ML monitoring is concerned with things like model drift, the distribution of features, and the accuracy of predictions. These are still of some interest &#8212; for example, we might want to track drift in the content of a typical user query &#8212; but the focus shifts more to tracing the agent&#8217;s reasoning chain. What tools did it call? What did they return and how did the LLM interpret the tool response? What decisions were made?</p><p>This is harder than it sounds, because a single agent run might involve many LLM calls, each with its own context, system prompt, and settings. Traditional logging isn&#8217;t quite enough, because we need to connect together every step within the run.</p><p>Tools like <a href="https://github.com/langfuse/langfuse">LangFuse</a> are designed to help with this. Langfuse will keep track of LLM calls, along with tool invocations, as well as embeddings and retrievals. It also provides the means to manage and version control prompts. Another tool is LangSmith, built by LangChain, although worth noting this one is not open source.</p><p><strong>Evaluations for agents</strong></p><p>While observability tells us what our agents are doing while they&#8217;re in production, evaluation is how we determine whether an agentic workflow is <em>correct</em>, as well as <em>safe</em> and <em>secure</em>. Ideally, we want to run evaluations prior to any deployment. Think about it as a full end-to-end system test.</p><p>There&#8217;s an emerging discipline around evaluating the outputs from an LLM, which we recently wrote about in <a href="https://www.mlops.wtf/p/with-great-predictive-power-comes">edition 19</a>. As well as standard or &#8220;happy path&#8221; inputs, we want to test edge cases, and adversarial inputs (e.g. trying to break the guardrails or safety features). Because LLM output is stochastic, to evaluate the outputs we often need to use semantic similarity scoring, or even use <em>another</em> LLM to judge an output.</p><p>As we&#8217;ve seen, with agents, we aren&#8217;t just dealing with single LLM invocations. We also need a way to evaluate a whole workflow. Take our calendar booking example. Success isn&#8217;t just &#8220;did it book a meeting?&#8221; You also need to know: did it check the right calendars? Did it ask clarifying questions when needed? Did it handle conflicts gracefully? Did it book a meeting at a time that actually makes sense?</p><p>A tool like <a href="http://evidently.ai/">Evidently AI</a> provides the functionality for evaluating individual LLM calls, but it also supports evaluations at the workflow level. For example, tracking workflow progress and failed steps.</p><p>A key thing to remember is that evaluation doesn&#8217;t just happen once. It&#8217;s something that should be done every time you want to deploy a change. In agentic applications, a small change can have far-reaching and hard-to-predict implications. Additionally, many of the techniques used &#8212; like LLM as a judge &#8212; can also be used within live monitoring in order to flag up problems in production.</p><p><strong>Where next?</strong></p><p>This has been an overview of what the MLOps landscape looks like for agentic AI. But this is a big topic, and we&#8217;ll be following up with some deeper dives into agents in the next few editions. </p><p>But, to round up, one final observation I&#8217;ve made is just how much the challenges of agentic AI engineering resemble those of distributed systems. I think we can expect to see more and more influence from the world of distributed systems showing up in the future.</p><div><hr></div><h2><strong>And finally</strong></h2><p>What&#8217;s Coming Up<br>Our next MLOps.WTF event is living on the edge, or specifically for edge AI should we say. Details yet to be fully released but tickets will sell out - if you want to join us, get in early!<br><br>&#128467;&#65039; <strong> Meetup #7. 22nd January 2026 x Arm</strong>&#128071;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-7-22nd-january-2026-x-arm-tickets-1968700520252?utm-campaign=social&amp;utm-content=attendeeshare&amp;utm-medium=discovery&amp;utm-term=listing&amp;utm-source=cp&amp;aff=ebdsshcopyurl&quot;,&quot;text&quot;:&quot;Get my MLOPs.WTF ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-7-22nd-january-2026-x-arm-tickets-1968700520252?utm-campaign=social&amp;utm-content=attendeeshare&amp;utm-medium=discovery&amp;utm-term=listing&amp;utm-source=cp&amp;aff=ebdsshcopyurl"><span>Get my MLOPs.WTF ticket</span></a></p><h2>About Fuzzy Labs</h2><p><em>We&#8217;re Fuzzy Labs. A Manchester based open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. </em></p><p>We&#8217;re growing fast, and hiring the following roles:</p><ul><li><p><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Enginee</a>r</p></li><li><p><a href="https://fuzzy-labs.webflow.io/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li></ul><p>If you, or someone you love, enjoys building reliable ML systems and doesn&#8217;t mind the <s>odd  </s>frequent debate about coffee brewing methods, have a look at our careers page.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://fuzzy-labs.webflow.io/careers#job-vacancies&quot;,&quot;text&quot;:&quot;Open Roles&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://fuzzy-labs.webflow.io/careers#job-vacancies"><span>Open Roles</span></a></p><p>If this edition was useful, pass it on. You can also find us on LinkedIn, where we post updates, videos, and the occasional explanation.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/ai-agents-in-production-all-this?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/p/ai-agents-in-production-all-this?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p>Not subscribed yet? Strange. The next edition will be our agents deep dive &#8211; workflows, coordination, make sure you&#8217;re signed up to get it to your mailbox.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Videos from MLOps.WTF #6]]></title><description><![CDATA[3 Great Talks Covering AI Evaluations]]></description><link>https://www.mlops.wtf/p/videos-from-mlopswtf-6</link><guid isPermaLink="false">https://www.mlops.wtf/p/videos-from-mlopswtf-6</guid><dc:creator><![CDATA[Tom Stockton]]></dc:creator><pubDate>Tue, 25 Nov 2025 11:38:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/24b93fdb-411e-45fe-860a-c19c5bf65187_4032x3024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>All the slides from the talks are here:</p><div class="file-embed-wrapper" data-component-name="FileToDOM"><div class="file-embed-container-reader"><div class="file-embed-container-top"><image class="file-embed-thumbnail" src="https://substackcdn.com/image/fetch/$s_!tQYS!,w_400,h_600,c_fill,f_auto,q_auto:best,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a702b2a-b241-4ba4-a864-994f8833efbb_955x539.png"></image><div class="file-embed-details"><div class="file-embed-details-h1">MLOps.WTF #6 Slides</div><div class="file-embed-details-h2">8.85MB &#8729; PDF file</div></div><a class="file-embed-button wide" href="https://www.mlops.wtf/api/v1/file/a1f8067f-9029-4b10-99ea-f1132dda5867.pdf"><span class="file-embed-button-text">Download</span></a></div><a class="file-embed-button narrow" href="https://www.mlops.wtf/api/v1/file/a1f8067f-9029-4b10-99ea-f1132dda5867.pdf"><span class="file-embed-button-text">Download</span></a></div></div><p>Matt&#8217;s intro.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;6cf6a340-b033-420b-a374-0a8c74fa51d2&quot;,&quot;duration&quot;:null}"></div><p>First up - Daisy &#8230;</p><div id="youtube2-clG3P1yoSw0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;clG3P1yoSw0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/clG3P1yoSw0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Brad up next &#8230;</p><div id="youtube2-PlddJSisZFM" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;PlddJSisZFM&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/PlddJSisZFM?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Liam wrapping things up.</p><div id="youtube2-oYZhyL7uA0M" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;oYZhyL7uA0M&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/oYZhyL7uA0M?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Like this? Subscribe to get notified of more &#8230;</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/videos-from-mlopswtf-6/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/p/videos-from-mlopswtf-6/comments"><span>Leave a comment</span></a></p>]]></content:encoded></item><item><title><![CDATA[Monitoring, evaluating, and why you really gotta catch ‘em all!]]></title><description><![CDATA[MLOps.WTF Edition #20]]></description><link>https://www.mlops.wtf/p/monitoring-evaluating-and-why-you</link><guid isPermaLink="false">https://www.mlops.wtf/p/monitoring-evaluating-and-why-you</guid><dc:creator><![CDATA[Rhiannon]]></dc:creator><pubDate>Thu, 20 Nov 2025 11:46:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7f7eaafd-9c96-4c12-a6c7-aa12a6298f97_4032x3024.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Our MLOps.WTF meetup #6, where Pok&#233;mon references outnumbered technical diagrams, and the clicker staged a full rebellion pre-kick off.</em></p><p>It might have been cold November rain outside, but it was another record turnout for<a href="https://www.mlops.wtf/"> MLOps.WTF</a> #6, our first time taking the meetup on the road to Matillion&#8217;s brilliantly retro office, complete with a green Terrazzo reception desk. The theme of the meetup, however, was anything but retro: How do you monitor and evaluate AI? And a kick off show of hands revealed about half the room knew what this meant.</p><p>When we talk about evaluations in MLOps and AI, we&#8217;re talking about the tools and techniques that give us confidence our machine learning <strong>system</strong> is working. System, not model, because the model is only a small part of it. When you&#8217;re building an AI-powered product, you need to evaluate the whole thing.</p><p>There&#8217;s also a distinction between evaluation and monitoring: monitoring is the ongoing thing you do in production because data changes and there are things you didn&#8217;t anticipate, while evaluation is what you run before deployment to understand whether the thing works correctly in the first place.</p><p>Settle in for three talks on fraud detection, interview intelligence, and digital data engineers - one with Pok&#233;mon scattered throughout, one apologising for the lack of Pok&#233;mon, and one warning us about &#8220;the world&#8217;s worst diagram.&#8221;</p><div><hr></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/abba3f48-de1a-4723-900c-c3ce4e3bbbcc_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0df1d156-7e9d-43da-b0e5-742ec0d8aee5_6850x3788.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/557f5574-243c-4c16-9d7f-7ce993fc7c74_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8f09920-b15a-4792-b3a1-6f664585d9fc_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4477b678-dbbd-4ec7-bf4b-a78fb6188422_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2fa353e-7319-4c01-8274-8482ac61ba8a_3024x4032.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/094b9113-7c20-4726-9681-8a93a717ef84_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4714a4b0-59a5-4bb7-910b-6d5dd07d6f00_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/261a3606-debe-4fe9-9cd7-58da1b82cdd1_4032x3024.jpeg&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db06b3c0-0af7-42d0-bf58-4b6286b97c6e_1456x1454.png&quot;}},&quot;isEditorNode&quot;:true}"></div><div><hr></div><h2><strong>Daisy Doyle: &#8220;Fraud &#8211; Gotta Catch &#8216;Em All&#8221;</strong></h2><p>Daisy, data scientist at Awaze, committed to the Pok&#233;mon theme. Trainers and Pok&#233;balls throughout, plus two fictional companies &#8220;Eevee Trading Cards&#8221; and &#8220;Snorlax Spa Breaks&#8221; for her case studies that she assures us bear no resemblance to anywhere real.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OVf1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OVf1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 424w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 848w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 1272w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OVf1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png" width="1456" height="822" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2137101,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/179443294?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OVf1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 424w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 848w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 1272w, https://substackcdn.com/image/fetch/$s_!OVf1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd313f19-8453-45ac-ad81-87fcae56df06_2296x1296.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The scale of the problem</strong></h3><p>Last year, in the UK, e-commerce fraud hit over 3 million events, equalling over 1 billion pounds stolen and just under 1.5 billion prevented. Which at around 60%, is something, but not great.</p><p>The fraud comes in different flavours - promo abuse, chargebacks, account hijacking, triangulation - and which ones you&#8217;re dealing with shapes how you evaluate.</p><h3><strong>AWS Fraud Detector</strong></h3><p>Daisy uses AWS Fraud Detector, a fully managed service that takes 18 months of historical data with fraud/legitimate labels and builds a model. It&#8217;s a black box so you can&#8217;t see what&#8217;s under the hood, but orders get a fraud likelihood score between 0 and 1000.</p><p>She applied this to two very different scenarios&#8230;</p><p>Eevee Trading Cards: high volume, limited edition cards, with fraud trends changing every four to six weeks around new releases, mostly account takeover.</p><p>Snorlax Spa Breaks: slower moving, with fraud clustering around big Pok&#233;mon calendar events, mostly chargebacks and triangulation.</p><p><strong>Why accuracy doesn&#8217;t work here</strong></p><p>&#8220;If my model said all orders are legitimate, my accuracy would be 99.5%. Because fraud is typically under half a percent of all orders.&#8221;</p><p>Accuracy is useless when your data is that imbalanced. What actually matters is true positive rate and false positive rate, because you want to catch as many fraud events as possible whilst not stopping real customers, and some real orders do look a bit fishy.</p><p>Before deployment, begin by testing in stages: start with a sample of 100 fraud events to check the model catches them, then 50/50 split of legitimate to fraudulent, then the realistic ratio of 2% fraud against 98% legitimate to see if it can still pick out the signal when it&#8217;s more sparse.</p><p>AWS Fraud Detector also provides variable enrichment through their own databases of fraudulent emails, addresses, phone numbers. Geolocation enrichment worked particularly well for Eevee by cross-referencing billing address, IP address, and shipping address to flag orders placed in a different country to where they&#8217;re being shipped.</p><h3><strong>Training the trainers</strong></h3><p>Human reviewers are required for GDPR, but they have their own bias. Reviewers can anchor on an AI score even when the model isn&#8217;t that strong yet, so you need to build their confidence in the system.</p><p>This means running a trial period where you&#8217;re validating both the model and the reviewers. For fast-turnover goods like Eevee, you can let some orders through and wait to see what actually turns out to be fraud - you get ground truth to check the model against, and reviewers get to see how their judgment compares to outcomes. For high-cost items like Snorlax where chargebacks take up to a year, you invest more in the review process itself: more time, more information, let them speak to customers directly. Your business context determines how you build that confidence.</p><h3><strong>Retraining and thresholds</strong></h3><p>Retraining frequency depends on how fast your fraud moves. Fast-moving trends like Eevee need monthly or faster, while slower seasonal trends like Snorlax can be quarterly or around big calendar events. There&#8217;s also a limitation with Fraud Detector that it can only process one batch of predictions at a time, so you need to think about batching.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XZuA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XZuA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 424w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 848w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XZuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1302682,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/179443294?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XZuA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 424w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 848w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!XZuA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb743aadc-98cd-467d-8078-7ed8ee2557a8_2294x1294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>And how much fraud should you let happen? Sounds counterintuitive, but your model needs fraudulent data that&#8217;s definitely fraud to retrain on. When you stop an event in progress you never know for certain whether it actually was fraud. So there&#8217;s an argument for letting through a small percentage based on your risk appetite, just to maintain training data quality.</p><p>The takeaway from Daisy&#8217;s presentation: monitoring and evaluating is shaped by business context and data, not just technical constraints. And there is no such thing as too many Pok&#233;mon when it comes to MLOps.WTF presentations.</p><div><hr></div><h2><strong>Bradney Smith: &#8220;Six Lessons in Evaluating Gen AI&#8221;</strong></h2><p>Bradney, AI Lead at Spotted Zebra, apologised for the lack of Pok&#233;mon, but he did give us six really great lessons on evaluating gen AI,  so we&#8217;ll let it slide.</p><p>Setting the scene, and taking us on a story, Brad joined Spotted Zebra in August 2024. His first task: to build out the AI team and infrastructure, with his first project being &#8220;Skills Evaluation&#8221; (extracting evidence of soft and technical skills from interview transcripts.)</p><p>The only way to know if the AI was working was to have the occupational psychologists review the outputs. Make changes to prompts, send to the experts, wait for feedback, iterate.</p><p>Then a prospective client got interested. Really interested. Feedback went from informal and infrequent to formal and regular, sometimes multiple times a day. A year&#8217;s worth of development in a few months.</p><p>The manual review loop couldn&#8217;t keep up.</p><p>That&#8217;s when they built evaluation infrastructure.</p><h3><strong>Lesson 1: Golden examples</strong></h3><p>A golden example is a document where you define exactly what the correct output should be for a given input. For Skills Evaluation, that means: interview transcript goes in, correctly extracted evidence comes out. You create a set of these with your domain experts, and that becomes your gold standard to test against.</p><p>The shift is significant. Instead of sending every change to experts for review, you test against the examples they&#8217;ve already created. Now you can measure properly. Prompt engineering becomes quantifiable experiments instead of gut feel.</p><p>They stratified their golden examples by role level and industry. That granularity meant they could see exactly where prompts were failing - junior roles missing university experience because prompts only looked for workplace evidence, for instance.</p><p>When GPT-5 came out, they tested it on day one. Every new model, even in the same family, has quirks. Golden examples told them exactly what those quirks were so they could prompt around them instead of assuming newer means better, although in this case - GPT-5 is pretty good.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uiXa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uiXa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 424w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 848w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uiXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:339602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/179443294?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uiXa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 424w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 848w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 1272w, https://substackcdn.com/image/fetch/$s_!uiXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89171ff2-a472-4a7e-aaa5-23ebfae95650_2318x1298.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Lesson 2: Version your prompts</strong></h3><p>&#8220;Please don&#8217;t hard code your prompts. It makes things so much more difficult.&#8221;</p><p>Treat prompts like code. Use a structured file system with semantic versioning. They built their own format because they couldn&#8217;t find one they liked: YAML file with provider, parameters, system prompt, user prompt with templating. Everything needed to run the experiment again.</p><p>Keep a changelog so you can track how prompts improve over time. Keep a config file on prod so you can roll back without restarting the server.</p><h3><strong>Lesson 3: Model gateway</strong></h3><p>They kept finding engineers across the business writing the same API call logic over and over. Same boilerplate, different codebases, nobody maintaining it consistently. So they built a model gateway - one function in a commons library that everyone uses. It parses the prompt files, packages the API calls, captures latency and cost metrics, handles retry logic.</p><p>You build it once, and suddenly that&#8217;s one less thing for everyone to think about.</p><h3><strong>Lesson 4: LLM as a judge</strong></h3><p>Now, golden examples work when there&#8217;s a correct answer - information extraction, classification, etc. But Spotted Zebra also has a feature that generates interview questions. Suddenly there&#8217;s no single correct question - you could write thousands that are all slightly different, but all equally good.</p><p>For tasks like that, you can define your golden criteria instead. Asking what makes a good interview question? Rather than the wording of the questions themselves. The LLM judge then has your criteria and scores outputs against them. You can then use a different model family to avoid self-bias, and validate your judge against human expert assessments before you trust it.</p><p>Golden examples are for development time - you test before deployment. LLM as a judge can run at inference time, giving you continuous quality assessment in production.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Pxe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Pxe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 424w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 848w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 1272w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Pxe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png" width="1456" height="573" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:573,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202963,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/179443294?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Pxe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 424w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 848w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 1272w, https://substackcdn.com/image/fetch/$s_!7Pxe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f7f693b-2a41-4feb-bf40-70360e5d63c5_1992x784.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Learn more about <a href="https://www.youtube.com/watch?v=1wA2YdRifJ4">&#8216;LLM as a judge&#8217; from our last MLOps.WTF meetup</a> with Emeli from Evidently AI <a href="https://www.youtube.com/watch?v=1wA2YdRifJ4">here</a>.</em></p><h3><strong>Lesson 5: Turn failures into tests</strong></h3><p>Most teams dread production errors - Spotted Zebra used to as well, but they now maintain what they call an adversarial testing bank: a collection of the most difficult inputs they&#8217;ve encountered, e.g. edge cases, prompt injections, empty inputs, password-protected PDFs.</p><p>At scale, edge cases stop being edge cases - they become the norm. So every time they find an error in production, it goes straight into the bank.</p><p>&#8220;We used to absolutely dread seeing errors of course... but now they happen much less frequently. So when they do occur, we get a little bit excited.&#8221;</p><h3><strong>Lesson 6: Log everything properly</strong></h3><p>You don&#8217;t need logs until you do.</p><p>And you can&#8217;t predict when that will be, and when you need them you need them fast, so make them searchable.</p><h3><strong>Earning your stripes (or spots)</strong></h3><p>So what did all this get them? Development is faster because they&#8217;re not waiting for the old feedback loop. Product quality is better. When clients ask how you know your black box system is working, you can actually have an answer.</p><p>And as Bradney pointed out: the EU AI Act is coming in 2026. The companies with evaluation infrastructure will be ready.</p><div><hr></div><h2><strong>Liam Stent: &#8220;From Vibes to Data&#8221;</strong></h2><p>Liam Stent from Matillion has been building Maia from day one. His talk: how do you go from &#8220;it feels like it&#8217;s working&#8221; to actually being able to prove it?</p><h3><strong>What is Maia?</strong></h3><p>Maia is Matillion&#8217;s digital data engineer - or rather, a team of digital data engineers. It builds pipelines, builds connectors, does root cause analysis, writes documentation. The output is DPL, Data Pipeline Language - a YAML-based format that&#8217;s human-readable.</p><p>Early reactions were brilliant - customers could see how it would change how their data engineers work. But when you&#8217;re scaling to enterprise customers paying hundreds of thousands of pounds, you need to be confident in the data. You need to be able to prove it.</p><p>Matillion has a value: innovate and demand quality, with the tagline &#8220;no product, process or person is ever finished.&#8221; They used that to drive how they measured Maia - starting simple, getting more structured, then automating.</p><p><strong>Starting simple</strong></p><p>They fed their certification exam to AI prompt components. Some models failed, some passed. It taught them how to use RAG effectively and get LLMs familiar with DPL using their documentation - a starting point for what &#8220;working&#8221; actually looked like.</p><h3><strong>Getting structured</strong></h3><p>They then introduced an LLM judge to measure what Maia was producing. (seeing a theme here).</p><p>This was tricky because there are many valid ways to build a pipeline - you can put everything in a Python script and get the same result as a nicely structured low-code pipeline. So they had to teach the judge what good looks like using thousands of reference pipelines.</p><p>Before trusting the judge, they validated it manually. Domain experts would write the expected answer, provide counterpoints, assess how much confidence they had in the judge&#8217;s scoring. They found judges show bias within model families, so you need to use a different model to evaluate than the one doing the work.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e20f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e20f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 424w, https://substackcdn.com/image/fetch/$s_!e20f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 848w, https://substackcdn.com/image/fetch/$s_!e20f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!e20f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e20f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:630767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/179443294?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e20f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 424w, https://substackcdn.com/image/fetch/$s_!e20f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 848w, https://substackcdn.com/image/fetch/$s_!e20f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!e20f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bf82b2-a467-4a26-b0bc-119ed0cfc478_2312x1300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Automating it</strong></h3><p>They built evaluation into their test framework - runs on builds and deploys with a bank of prompts and context variations. All output goes into LangFuse dashboards: scores, time taken, tokens used, costs. Engineers drill into individual traces when something goes wrong.</p><p>&#8220;They like it when things go wrong because that&#8217;s how we&#8217;re able to learn from it.&#8221;</p><p>When Sonnet 4.5 came out on Bedrock they upgraded with confidence in 24 hours. They could show stakeholders the testing, results before and after, why they were confident. Evidence instead of gut feel.</p><h3><strong>Integrating ML skills into engineering teams</strong></h3><p>How does a traditional Java software engineering shop integrate data scientists and MLOps engineers? Don&#8217;t treat it as anything special. Same backlog, same standups, same roadmap. If the work is important, do it.</p><p>They had to educate stakeholders on why they needed engineering cycles making things better without adding features - that&#8217;s the MLOps work that makes everything else possible.</p><p>But then it just became normalised.</p><p>Good engineers want to learn from other good engineers, T-shaped skills develop naturally, and the more generalists can handle without handing over, the more experts can do deep work.</p><div><hr></div><h2><strong>Rounding up MLOps.WTF #6</strong></h2><p><em>Zooming out, these are our top takeaways:</em></p><ul><li><p>Choose metrics that actually tell you if the system is working.</p></li><li><p>Front-load expert effort into examples and criteria, not reviews.</p></li><li><p>Treat prompts like code.</p></li><li><p>Build evaluation infrastructure once.</p></li><li><p>Business context shapes every decision.</p></li></ul><p><em>Thank you to all our speakers! You set a high bar for MLOPs.WTF, full talks will be available soon on the Fuzzy Labs youtube. (Keep an eye out!)</em></p><div><hr></div><h2><strong>Final bits</strong></h2><p>Fancy your own pair of Fuzzy Labs socks? All speakers are awarded an aesthetically pleasing pair of Fuzzy Labs mathematical socks. Become our next speaker to get yours!</p><p>A final big thank you to Matillion for hosting. First time we&#8217;ve taken MLOps.WTF on the road, and we loved being at your space themed event space.</p><div><hr></div><h2><strong>What&#8217;s coming up</strong></h2><p><strong>Next MLOps.WTF event.<br>Jan 22nd. Tickets now available.</strong></p><p>Edge AI at Arm&#8217;s fancy new office. We&#8217;re still working on the full details but one to get in the diary for the new year.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlopswtf-event-7.eventbrite.co.uk&quot;,&quot;text&quot;:&quot;Get My Ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlopswtf-event-7.eventbrite.co.uk"><span>Get My Ticket</span></a></p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won&#8217;t cut it.</em></p><p><strong>We&#8217;re hiring! </strong>Fancy becoming the next Fuzzican? Check out our <a href="https://www.fuzzylabs.ai/careers">careers page.</a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.fuzzylabs.ai/careers&quot;,&quot;text&quot;:&quot;Work For Us&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.fuzzylabs.ai/careers"><span>Work For Us</span></a></p><p>Liked this? Forward it to someone making deployment decisions based on vibes. Or follow us on <a href="https://www.linkedin.com/company/fuzzy-labs/">LinkedIn.</a></p><p>Not subscribed yet? What are you waiting for?</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[With great predictive power comes great responsibility]]></title><description><![CDATA[MLOps.WTF Edition #19]]></description><link>https://www.mlops.wtf/p/with-great-predictive-power-comes</link><guid isPermaLink="false">https://www.mlops.wtf/p/with-great-predictive-power-comes</guid><pubDate>Thu, 06 Nov 2025 10:02:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PF5Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>This episode is brought to you by James Stringer, MLOps Tech Lead at Fuzzy Labs.<br><br></em>It is now 3 years since ChatGPT was released and the progress since then is staggering. The latest Large Language Models (LLMs) are incredibly powerful tools that are seeing increasingly deep integration into our software and our lives. In the last year we&#8217;ve seen the rise of &#8216;agents&#8217; &#8211; LLMs operating semi-autonomously to write code, search the web, and even interact with other software using the Model Context Protocol (MCP) standard. As the models released by the frontier AI labs become smarter and more capable, it seems inevitable that we will cede more and more of our workflows and responsibilities to them. The possibility of &#8216;intelligent software&#8217; that is aware of context and can choose the most appropriate course of action is truly tantalising.</p><p>But, herein lies two challenges.</p><p>First, in that the language models that underpin these agentic systems are inherently <em>stochastic</em>, meaning that there is an intrinsic degree of randomness and open-endedness in their responses. If I ask the same question to a model multiple times, I may very well get different responses. Compare this to &#8216;traditional&#8217; software which is in most cases <em>deterministic</em>, meaning that the same input always results in the same output. This certainty is crucial to building reliable software &#8211; when I hit delete on my 300 unread marketing emails I want precisely those to be deleted, but I definitely want the one from my mum asking about Christmas presents to be left alone. Given this <em>stochasticity</em>, how can we be sure that agentic software is behaving as we expect it to?</p><p>Second, are we to take the claimed performance of these models at face value? In every press release we see a new high score: &#8220;69% on LiveCodeBench&#8221;, &#8220;82% on MMMU&#8221;, &#8220;11% on Humanity&#8217;s Last Exam&#8221;, and so on. There is surely some signal in these metrics, but for me they pose much broader questions. There is evidence (<a href="https://arxiv.org/abs/2504.20879">Singh et al, 2025</a>, <a href="https://arxiv.org/abs/2502.06559">Eriksson et al, 2025</a>) that suggests data contamination in benchmarking is widespread, resulting in models overfitting for benchmarks and inflating their scores. It is also not so clear how well these capabilities transfer to &#8216;real life&#8217;, where these language models are implemented in production systems. Does a high score on a particular benchmark mean that a model will perform well in my unique use case? And can I really trust that this model only hallucinates 0.1% of the time?</p><p>In this article I&#8217;ll talk about how we overcome this uncertainty when building reliable AI software, discussing some of the techniques and tools that we use to quantify and control these increasingly intelligent models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PF5Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PF5Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PF5Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:203488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/178081181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PF5Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 424w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 848w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!PF5Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2a0a2a-687f-4bb7-a6e5-b4f5694c54e1_1600x900.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why not use traditional methods?</h3><p>Evaluating LLMs is difficult because the tasks for which they are commonly used are open-ended and qualitative; there may be multiple valid responses and the &#8220;correct&#8221; behavior often depends on context and user intent. Traditional NLP metrics capture only limited aspects of output quality and often miss nuances of factual accuracy or coherence, while &#8216;classic&#8217; ML approaches don&#8217;t cut it as the ground truth is poorly defined.</p><p>In production environments it is standard MLOps practice to keep track of data as it flows through the system. This means logging the input data and the models&#8217; predictions, as well as errors, exceptions, and infrastructure health: with these data we can construct a complete picture of the live system. For traditional machine learning models, we can monitor things like data drift and model drift &#8211; metrics with a clear statistical definition. Let&#8217;s say we&#8217;ve deployed a model to predict the price at which a house will sell, based on its postcode, number of bedrooms, square footage, and so on. As the model is used, we can start to understand the distribution of these variables and of the model&#8217;s predictions. Then, if our model starts to regularly predict house prices significantly lower than the average, we can easily see that there has been a change. And, because we&#8217;re monitoring both the input data and the model predictions, we can distinguish whether it is the distribution of input data that has changed, or if it is that the model <em>itself</em> is no longer accurate.</p><p>Unfortunately, both the breadth and nature of applications of language models mean that these changes become harder to detect: the modern LLM stack is such that we must be able to assess hallucination rates, retrieval accuracy and source adherence, bias and fairness, and instruction following. For agentic systems, there are even more: task completion, reasoning evaluation, and even cost efficiency of actions. To keep track of this we need to capture even more data: user prompts, model responses, metadata (e.g. which knowledge base articles were retrieved), and feedback from the user. The picture that we need to build of our models is much more complex, so it is worth highlighting some of the key considerations in brief.</p><h3>Hallucination, retrieval, and instruction following, oh my!</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NHiN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NHiN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 424w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 848w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 1272w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NHiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp" width="800" height="636" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:636,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32858,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/178081181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NHiN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 424w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 848w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 1272w, https://substackcdn.com/image/fetch/$s_!NHiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F650f4336-ae92-4097-9fc5-de7272db2b12_800x636.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Measured hallucination rates can be shockingly high: one medical literature review benchmark found GPT-4 making up references around 30% of the time, and only 13.4% of GPT-4 citations were found in the underlying corpus [<a href="https://www.jmir.org/2024/1/e53164/PDF">Chelli et al, 2024</a>]. For retrieval-augmented systems, this motivates the use of faithfulness scoring, where the responses are broken down into atomic claims and each tested against the retrieved context using another LLM. Other techniques include &#8220;needle-in-a-haystack&#8221; recall, where we plant a known fact in the input data and evaluate how well the model can retrieve it. Self-consistency checks, like SelfCheckGPT [<a href="https://arxiv.org/abs/2303.08896">Manakul et al, 2023</a>], can also be used. In these systems the same question is sampled multiple times as a way to determine if the model is effectively guessing. These techniques can be used for so-called selective abstention policies, where the model is given the ability to refuse to answer rather than guess.</p><p>LLMs are trained on vast internet text and can inadvertently learn or amplify societal biases present in that data. As such, bias and safety are as important to understand as factual accuracy. They can be evaluated with curated suites of prompts designed to elicit opinionated responses, LLM judges to directly evaluate responses, and even red-teaming. The latter is especially important as unsafe behaviour is often conditional on adversarial inputs, rather than normal queries; accordingly, suites such as JailbreakEval compile test queries that should elicit safe refusal [<a href="https://arxiv.org/abs/2406.09321">Ran et al, 2024</a>]. A further difficulty arises in that LLM-as-judge evaluators <em>themselves</em> are biased, with one paper noting a preference for American authors and open-access research papers [<a href="https://www.jmir.org/2024/1/e53164/PDF">Chelli et al, 2024</a>].</p><p>It&#8217;s also crucial that these models reliably do what we tell them to; as such, we evaluate the <em>instruction following </em>ability of the model. This includes two angles: helpfulness (does it follow the user&#8217;s request and solve their problem?) and compliance (does it obey the constraints and rules that we have defined?). Instruction following is typically evaluated by tracking the outcome of an action; for a model that generates summaries, we can see if the user has approved or rejected the summary. It&#8217;s also important to consider compliance at a more granular level, such as whether the number of returned items matches the number requested. In practice, the result is not always so clear cut, and there are common patterns where models typically fall short; it&#8217;s been observed that models can fail to follow negative constraints or, notoriously, avoid using em-dashes [<a href="https://www.jmir.org/2024/1/e53164/PDF">Chelli et al, 2024</a>]. User feedback can here be a direct signal, where response ratings (&#128077;/ &#128078;) and the sentiment of follow-up prompts (&#8220;That&#8217;s not what I asked for!&#8221;) give a clear indication of performance.</p><p>Agentic LLMs require further analysis: their success rate on suites of tasks; the correctness of their invocation of tools; their efficiency (number of tool calls and token usage); and importantly the safety of external actions. To complicate matters, agents sometimes get the right answer via the wrong reasoning path, which is not robust enough for use in production.</p><p>Ultimately, human review remains the gold standard for assessing the quality of outputs, as can be seen in the ever-present &#8220;ChatGPT can make mistakes. Check important info.&#8221; disclaimers. However, this isn&#8217;t really feasible at scale, especially where the LLM output feeds into another pipeline step, or indeed can take actions as part of an MCP server. As such, we need to consider a different approach, where we can robustly evaluate the model both in development and at scale. As such, we typically break down this evaluation into two stages: o<em>ffline evaluation</em> during development, then <em>online monitoring</em> once the model is live.</p><h3>Is my model fit for purpose?</h3><p>Offline evaluation primarily seeks to quantitatively establish the performance of a model when applied to a particular task, such as extracting information from text, summarising documents, or constructing database queries. Typically, this involves creating a dataset of test queries with expected responses. This dataset should cover a range of scenarios &#8211; standard interaction queries, edge cases, and adversarial inputs &#8211; to probe the model&#8217;s behaviour. We define what a &#8220;good&#8221; response looks like in each case, sometimes allowing multiple acceptable outputs, and then score the model&#8217;s responses against these ground truths.</p><p>However, this last step is the tricky part as exact matches can be too rigid. Here techniques from natural language processing (NLP), like measures of semantic similarly using embeddings, can be useful in some cases. We can even use a separate language model to act as a <em>judge</em> against some predefined criteria, although as Emeli Dral discussed in our <a href="https://www.youtube.com/watch?v=1wA2YdRifJ4">last MLOps.WTF meetup</a> these LLM evaluators can introduce their own biases or errors. In practice, the best method to use truly depends on the use case in question; for example, translation tasks are commonly evaluated using the BLEU family of metrics [<a href="https://dl.acm.org/doi/10.3115/1073083.1073135">Papineni et al, 2002</a>]. And, the good news is that once we have established our evaluation dataset and metrics these can be re-used whenever we update the model, allowing us to compare them directly. Crucially, we can integrate this evaluation into our CI/CD pipelines.</p><p>This approach provides a more rigorous, and importantly more realistic, framework for assessing the capabilities of language models. We no longer need to rely on claims and benchmarks, instead we can assess how they perform in (almost) the real world.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-ilf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-ilf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-ilf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg" width="781" height="712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:712,&quot;width&quot;:781,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:592237,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/178081181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-ilf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 424w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 848w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!-ilf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7da0ac72-4c61-4874-8791-b2f2dda4fb1d_781x712.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>I&#8217;ve deployed a model, now what?</h3><p>With our carefully constructed LLM playground, we can establish whether our chosen model is up to the task that we&#8217;ve assigned it. However, no plan ever survives contact with the enemy, or in this case, real people and their out-of-distribution requests. Production use necessitates accuracy, safety, and compliance under highly variable inputs; without continuous evaluation, a model that has appeared to work well in the lab can still be brittle, wrong, or harmful when used at scale. Therefore, it is just as important to keep an eye on our models in production through <em>online monitoring.</em></p><p>The evaluation of online metrics is largely the same as in the offline case, just with one key difference. We will have some input and a response as before, but now in the live situation we lack the ground truth. The solution is not always trivial, requiring a shift from static benchmarking to dynamic, context-aware evaluation.</p><p>Open Source tools, like Evidently, are becoming central in structuring this ongoing assessment of production models by combining monitoring of traditional metrics with modern techniques like the LLM-as-a-judge. This allows us to consider a wider gamut of evaluations, to build up a clearer picture of the system&#8217;s behaviour. Such evaluations can be embedded in real-time pipelines to flag issues like hallucinations, irrelevance, or misalignment with user instructions. These can then be visualised in dashboards and monitored for anomalies: a spike in hallucination scores, for instance, or a drop in satisfaction ratings, triggers alerts for a careful review.</p><p>Evaluating LLMs in production is certainly challenging, but ultimately is a clear necessity. These models are powerful yet unpredictable; without proper evaluation of their capabilities and monitoring of what they&#8217;re doing, we cannot trust them in real use cases. But, by assessing hallucinations, bias, source adherence, instruction following, and other facets, we gain visibility into the model&#8217;s behaviour and can establish the right safeguards. Thankfully, as these models are becoming more widespread in production systems, so are our tools to keep an eye on them.</p><p><em>James holds a PhD in astrophysics from the University of Manchester and has since built a career in commercial data science, developing enterprise machine learning software for the manufacturing industry. His focus is on creating practical, scalable tools that help businesses stay ahead of the curve. Outside of work, he&#8217;s a passionate climber with a deep love of nature and music.</em></p><div><hr></div><h2><strong>And finally</strong></h2><p><strong>What&#8217;s Coming Up<br></strong>Continuing on the theme, our next MLOps.WTF meetup takes place on 18th November at Matillion, where we&#8217;ll dig into how teams evaluate their AI systems in the real world. Expect practical stories on monitoring and evaluating ML and agentic AI: Brad Smith from Spotted Zebra will share how to build reliable evaluation pipelines for GenAI, Daisy Doyle from Awaze will talk lessons from fraud detection, and Julian Wiffen from Matillion will introduce Maia, their GenAI-powered data engineer.<br><br>&#128467;&#65039; <strong>Tuesday 18th November - Matillion Offices, Manchester &#128071;</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-6-18th-november-2025-x-matillion-tickets-1784006454329?aff=oddtdtcreator&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-6-18th-november-2025-x-matillion-tickets-1784006454329?aff=oddtdtcreator"><span>Get my ticket</span></a></p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won&#8217;t cut it.</em></p><p><strong>Currently:</strong> We&#8217;re growing (it&#8217;s a very exciting time!) and we&#8217;ve got a few roles to fill:</p><ul><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/public-sector-lead-secure-government">Public Sector Lead: Secure Government</a></p></li></ul><p>If you, or someone you know, is looking to work somewhere where coffee, sauce and general condiment preferences are regularly discussed and debated&#8230; don&#8217;t hesitate to apply!</p><p><strong>Liked this?</strong> Forward it to someone who loves monitoring and evaluating ML (one of us). Or give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> to be part of the wider Fuzzy Labs family.</p><p><strong>Not subscribed yet? </strong>What you waiting for? The next issue will be our meet up playback - and they&#8217;re always great value.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Make things open: it makes things better]]></title><description><![CDATA[MLOps.WTF Edition #18]]></description><link>https://www.mlops.wtf/p/make-things-open-it-makes-things</link><guid isPermaLink="false">https://www.mlops.wtf/p/make-things-open-it-makes-things</guid><dc:creator><![CDATA[Tom Stockton]]></dc:creator><pubDate>Thu, 23 Oct 2025 08:59:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OKDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>This episode is brought to you by Tom Stockton, CEO &amp; Co Founder at Fuzzy Labs.</em><br><br>Last month we sponsored the Manchester Tech Festival and were fortunate to hear the Manchester Mayor, Andy Burnham, talk about his vision for digital adoption in Manchester and the wider region.</p><p>He said something really quite powerful:</p><blockquote><p>&#8220;Young people in the boroughs can see the skyscrapers from their bedroom windows, but they don&#8217;t know the pathways to work in them.&#8221;</p></blockquote><p>I couldn&#8217;t get it out of my head and it gave me an idea about how open source could help with this mission&#8230;</p><p>I&#8217;m an open source nerd, and the solutions we build at Fuzzy Labs are rooted in open source. I wrote a <a href="https://www.linkedin.com/posts/tom-stockton-fuzzylabs_im-not-into-politics-but-andy-burnhams-activity-7379409827443191808-BLqf?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAADLPw4BztZEksdpcBZZ44biHn_vEkYAfEo">LinkedIn post</a> about how if our government focused on creating open source foundations for running digital public services, it wouldn&#8217;t <em>just</em> be cheaper and more efficient, it would also be more transparent. It could literally open doors for young people. Showing them how the software that runs our country is built, allowing them to learn from it, and maybe even contribute to it.</p><p>The post clearly struck a chord, so I thought I&#8217;d follow up with something longer. I wanted to look at how this could work in practice, and share some examples of where other countries have made open source a success. But when I dug deeper, I found something surprising. The UK is already doing this in some brilliant ways!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OKDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OKDq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 424w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 848w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OKDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg" width="1456" height="1091" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1091,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2074498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/176900718?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OKDq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 424w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 848w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!OKDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F954fecf3-db8f-4d8e-85c9-00900509fdaa_3319x2486.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>(Robbie with Andy Burnham and a bottle of our MLOps Open Sauce)</em></p><h3><strong>What we&#8217;re doing well</strong></h3><p>Have you ever noticed how UK government websites all look and feel the same? That&#8217;s because they&#8217;re built on an open source foundation called the<strong> </strong><a href="https://design-system.service.gov.uk/">GOV.UK Design System</a>.</p><p>It&#8217;s a project run by the Government Digital Service (GDS), and it&#8217;s used across hundreds of government services including HMRC, DWP, DEFRA, and the NHS.</p><p>Beyond the <a href="https://github.com/alphagov/govuk-frontend">open source code</a>, it&#8217;s a community with a mailing list, monthly calls, design days and a Slack channel. It has all the ingredients of a great open source project. Collaboration, openness, and a sense of shared ownership.</p><p>The result is that every time you visit a government website, you get a consistent and accessible experience. Departments don&#8217;t need to reinvent buttons, forms, or navigation every time they build something new. That has saved millions in duplicated effort, reduced delivery times, and made services easier to use. It&#8217;s one of the quiet success stories of modern government technology.</p><h3><strong>Changing the culture</strong></h3><p>What&#8217;s just as important as the code is the culture that came with it.</p><p>The GOV.UK Design System helped normalise open source inside government. GDS made it clear that openness is a strength vs a risk. Their mantra, <em><a href="https://www.gov.uk/guidance/government-design-principles#make-things-open-it-makes-things-better">&#8220;Make things open: it makes things better&#8221;</a></em> is more than a tagline.</p><p>It created a new way of working - designing and coding in public, sharing research, and letting others build on your work.</p><p>The Ministry of Justice (MoJ) has taken this mantra to the next level. Their Digital and Technology team is one of the strongest adopters of open source in government. They built their own<a href="https://github.com/ministryofjustice/moj-frontend"> </a><strong><a href="https://github.com/ministryofjustice/moj-frontend">MoJ Design System</a></strong>, extending the GOV.UK patterns for justice-specific services like case management and prisoner booking. Even more impressive is <a href="https://github.com/moj-analytical-services/splink/">Splink</a> (great name). An open source Python package for &#8216;probabilistic record linkage&#8217;. Used across gov from health to defence. It&#8217;s serious backend engineering solving a real data problem.</p><p>The MoJ team doesn&#8217;t just share code; they share their learnings through their <a href="https://mojdigital.blog.gov.uk/">Justice Digital</a> blog (built using their Design System frontend). I love it! Honestly, if I was looking for a job and wanted to work in Government I would be knocking on their door.</p><h3><strong>Other open source success stories</strong></h3><p>The culture of openness has spread.</p><ul><li><p>The <strong>Home Office</strong> has its own Design System for internal tools and services.</p></li><li><p>The <strong>NHS</strong> created the <strong>NHS.UK Design System</strong>, now used across the whole health service.</p></li><li><p>Local councils are collaborating through <strong>LocalGov Drupal</strong>, which lets them share code and design patterns instead of paying for separate CMS contracts.</p></li></ul><p>There&#8217;s a pattern here. Shared infrastructure that saves time and money while improving quality. We&#8217;ve defined some great examples. Now we just need to push it further.</p><h3><strong>Where it could go next</strong></h3><p>Unsurprisingly, I&#8217;m going to talk about AI.</p><p>If open source made our web services more transparent and trustworthy, imagine what it could do for government AI systems. Models that can be inspected, pipelines that can be reused, and decisions that can be audited, all built on open foundations.<br><br>From my research (and I&#8217;d love to be proven wrong!) this isn&#8217;t happening yet and there aren&#8217;t any initiatives to make it happen. The best I could find was a <a href="https://www.gov.uk/government/news/uks-best-ai-engineers-can-apply-now-to-build-tech-for-public-services-in-1-million-fellowship?utm_source=chatgpt.com">&#163;1m fellowship</a> to &#8220;build open-source AI tools for public services,&#8221; but it seems more like a partnership to promote Meta&#8217;s Llama models than a genuinely open source programme.<br><br>This is the opportunity. Take what worked for digital services and apply it to AI. It would make public AI projects faster, safer, and more explainable. Reuse what works, make it open, and let the community improve it.<br><br><strong>Where open source could make a real difference</strong></p><p>Take the Department for Work and Pensions&#8217; fraud-detection algorithm.</p><p>An <a href="https://www.theguardian.com/society/2024/dec/06/revealed-bias-found-in-ai-system-used-to-detect-uk-benefits">investigation</a> found that it disproportionately flagged people by age, disability and nationality. As Science and Technology Secretary Peter Kyle said, the public sector &#8220;hasn&#8217;t taken seriously enough the need to be transparent in the way that the government uses algorithms.&#8221;<br><br>In my view an open source, auditable model training pipeline would have helped massively. Our government obviously wants to deploy AI, but it should also publish how these systems are built, by whom, on what data, and with what fairness checks.<br><br>Transparency is one side of the problem, another is duplication.<br><br>The National Audit Office wrote an <a href="https://www.nao.org.uk/wp-content/uploads/2024/03/use-of-artificial-intelligence-in-government.pdf">excellent report</a> highlighting &#8216;overlap&#8217; in the responsibilities for AI adoption between departments (DSIT and Cabinet Office). That&#8217;s a strong indicator that we&#8217;re duplicating effort at a policy and strategy level, let alone at a technical one! <br><br>I&#8217;d like to see a central policy that talks about building open source AI foundations that can be used across departments. The same principles that made <a href="http://gov.uk">GOV.UK</a> a success could do the same for AI. It seems we&#8217;re a bit away from this, but it&#8217;s where we should be heading.</p><h3><strong>What&#8217;s holding us back</strong></h3><p>Despite the great examples from GDS and MoJ, many departments still buy proprietary software to solve the same kinds of problems multiple times.</p><p>Why? Partly culture. Partly procurement. Maybe a lack of confidence or awareness of how to run open projects long term. Maybe it&#8217;s a concern over open source being less secure than closed source (not an opinion I share!). The intention is usually good, to de-risk delivery, but it often leads to silos and wasted spend.</p><p>We should look to change the default and become more comfortable with working in the open. The success of projects from GDS and MoJ show that it&#8217;s entirely possible to build shared, production-grade software in public.</p><p>At Fuzzy Labs, we see opportunities to open source parts of our own AI projects in government. Because it makes things better. Open projects attract contributors, create transparency, and help build technical capability in our country.</p><p>And that&#8217;s really the point.</p><p>When we make things open, we don&#8217;t just make better software.</p><p>We make better teams, better learning, and maybe better pathways for the next generation looking up at those skyscrapers.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h2>And Finally&#8230;</h2><p><strong>Want to contribute?<br></strong>If you&#8217;re working in MLOps or building open source AI tools, we&#8217;re always looking for guest writers. Share your production wins, your spectacular failures, or that tool you&#8217;ve built that actually solves a problem. The best insights come from people doing the work &#8211; so if you&#8217;ve got something worth sharing, get in touch.</p><p>While we&#8217;re on the topic, the team have been cooking something up&#8230;</p><p>We&#8217;ve been sketching out &#8220;recipes&#8221; for the MLOps lifecycle &#8211; reusable, step-by-step ways of solving common problems. Think handling new data safely, tracking experiments, serving models in production, setting guardrails, retrieval-augmented generation (RAG), and security.</p><p>It&#8217;s very exciting &#8211; we&#8217;ve even got personalised chef hats for every squad member. But the real question is this: which stage of the MLOps lifecycle do you think most needs a recipe right now? (Not rhetorical - answers encouraged.)</p><h2><strong>What&#8217;s Coming Up</strong></h2><p>&#128197; <strong>18th November</strong>: The MLOps.WTF meetup is back at Matillion for edition #6, where we&#8217;ll dig into monitoring and evaluating ML and agentic AI with real-world stories. Brad Smith (Spotted Zebra) will share how to build robust evaluation pipelines for GenAI, Daisy Doyle (Awaze) will talk lessons from fraud detection projects, and Julian Wiffen (Matillion) will introduce Maia, their GenAI-powered data engineer. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-6-18th-november-2025-x-matillion-tickets-1784006454329?aff=oddtdtcreator&quot;,&quot;text&quot;:&quot;Get my ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.co.uk/e/mlopswtf-by-fuzzy-labs-meetup-6-18th-november-2025-x-matillion-tickets-1784006454329?aff=oddtdtcreator"><span>Get my ticket</span></a></p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won&#8217;t cut it.</em></p><p><strong>Currently:</strong> It&#8217;s an exciting time for us in Manchester, and as always, we&#8217;re calling out for great talent, see our open roles below:<br><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p><p><strong>Liked this?</strong> Forward it to someone who&#8217;s interested in the potential of Open Source and how we can invest in our . Or give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> to be part of the wider Fuzzy Labs community.</p><p><strong>Not subscribed yet? </strong>Get great MLOps content straight to your inbox, we&#8217;ll also be sharing some sneak peaks of the recipe book here too - so well worth a sign up!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[My Type On Paper: The Future of Software Engineering]]></title><description><![CDATA[MLOps.WTF Edition #17]]></description><link>https://www.mlops.wtf/p/my-type-on-paper-the-future-of-software</link><guid isPermaLink="false">https://www.mlops.wtf/p/my-type-on-paper-the-future-of-software</guid><pubDate>Thu, 09 Oct 2025 09:05:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yLg7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ahoy there &#128674;,</p><p><em>This episode is brought to you by Sam Greenwood, MLOps Tech Lead at Fuzzy Labs.</em></p><p>There&#8217;s a lot of hype around vibe coding at the moment. Everyone&#8217;s theorising about what it means for the <a href="https://www.mlops.wtf/p/matt-squire-are-we-the-last-programmers">future of programming</a>, but very few people are truly putting these tools through their paces.</p><p>I&#8217;ve been using AI to assist my coding since GitHub Copilot launched in 2021. I&#8217;ve watched the capabilities increase from line completions that felt like pure magic to fully autonomous coding agents. Given I&#8217;ve been using these tools for quite some time (in the recent timeline of software), I&#8217;m also very aware of their shortfalls and their ability to cause all sorts of problems, if used the wrong way.</p><p>So when I wanted to properly test vibe coding, I wasn&#8217;t going to do it on customer projects but I needed a sandbox with real stakes. Inspired by our Hackathon and a desire to remove some of the manual admin from hiring, I decided to set myself the task of building an Applicant Tracking System (ATS), for Fuzzy Labs, using only AI coding tools.</p><p>Not a single line of code written manually. Love Island on in the background. A true test of if you can actually build production-ready software entirely through prompting, whilst equally being ready for the inevitable &#8220;I&#8217;ve got a text&#8221;.</p><p>However, my &#8220;experiment&#8221; wasn&#8217;t solely about coding today. It was, on a deeper level, about testing what I believe programming will look like in 2030, and what this current trajectory means for the entire SaaS industry.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yLg7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yLg7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yLg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2336753,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/175627423?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yLg7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 424w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 848w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!yLg7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2cb13126-079e-43e2-bbf6-e46f619ec6d0_3000x2000.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>(Artistic representation of what we imagine Sam looked like creating the ATS)</em></p><h2><strong>My prediction</strong></h2><p>I am very confident that the way software is built is going to look very different in the next 5-10 years.</p><p>The capabilities of LLMs have increased drastically in the last 3 years and I believe this is going to keep being the case. Besides the raw capabilities improving, the tooling that leverages them is going to keep getting better. Using Cursor, for example, is a much better way of using AI for coding than copying and pasting code into ChatGPT and then copying the output back.</p><p>And we&#8217;ll only get more tools like Cursor and tools like Cursor will keep on improving.</p><p>As a result of these developments, the speed of building software is going to increase and subsequently the cost is going to fall. What once took a squad of 5 engineers could potentially be done by 1 engineer using AI assistance.</p><p>My mum was a programmer in the 90s and when you compare the way people coded then to now, it is completely different. So when the narrative arises that the way we build software today will stay the same for the next twenty years, it seems counter intuitive. It never has before.</p><p>This shift puts lots of SaaS businesses into a precarious position. It <em>does</em> raise the question &#8220;is SaaS dead?&#8221; when companies and individuals can just create their own software instead of paying for expensive subscriptions.</p><p>There is some truth in that we might start seeing LLMs as an operating system in the future (dynamically creating applications for certain workflows on the fly). I&#8217;m not saying that this is what will happen, just that it has the potential.</p><p>One of the reasons I&#8217;m sceptical of SaaS being dead, at least in the short term, is that even if a company could in theory build their own HR, communication, or sales platform using AI, they would still have to manage it, ensure it stays up to date, address security concerns, bugs, and so on. There&#8217;s an argument AI can do all of that too, but in its current form it still needs an experienced engineer to oversee.</p><p>But what I <em>do</em> think will happen, is that disruptors will come into the SaaS industry and build competing software for a fraction of the price of what it currently costs. They&#8217;ll be leveraging AI assistance heavily and have a fraction of the number of expensive engineers. Therefore they can offer the same SaaS products for tens of pounds a month rather than the hundreds .Even more crucially, they can ship features faster and build a superior product to incumbents that aren&#8217;t making effective use of these tools.</p><p>Why try and build software internally when someone else can sell it to you for &#163;40 a month and <em>they</em> can be responsible for new features, security updates, compliance, and so on? It&#8217;s a win win.</p><p>So that&#8217;s the theory. But I wanted to test where we <em>actually</em> are with vibe coding a SaaS product, how far we are from this near future, and what tools need to be developed to get us there.</p><h2><strong>The experiment</strong></h2><p>I asked ChatGPT Deep Research what the best way was of building this full-stack application. It recommended Railway for deployment. I prompted it to build it out and it worked really well. The frontend came together shockingly fast.</p><p>Then the GDPR problem emerged. An ATS stores personal identifiable information with strict GDPR requirements. Railway didn&#8217;t provide enough configurability as there were no options to have data stored in the UK. This could get problematic with GDPR. The trust just wasn&#8217;t there.</p><h2><strong>Where it got a bit muggy</strong></h2><p>The backend was probably the biggest challenge. I moved it to AWS Amplify but Amazon has released two major versions of amplify and the models kept getting confused between version one and version two. Meaning it kept breaking.</p><p>I spent all evening trying to prompt Claude to fix a deployment error and couldn&#8217;t do it. I knew if I dug into the errors myself I could fix it, but that would defeat the point. The experiment needed to be just prompting.</p><p>GPT-5 came out the next day. Same prompt. Fixed first try&#8230; just goes to show.</p><p>I also solved a lot of issues using an MCP server called Context7, which grounds models in actual documentation.</p><p>Since deployment, I can now build new features much quicker. But I still sometimes hit problems, it could be that AWS isn&#8217;t the right approach, or there needs to be middleware between the models and AWS, but I digress.</p><h2><strong>Cracking on</strong></h2><p>At regular stages throughout development, I&#8217;d prompt Cursor to do full security reviews, create tickets, then fix things. This caught quite a few issues so would highly recommend you do this also.</p><p>The ATS is now in testing with the team using it for actual hiring. And the team has been rigorously testing and trying to break it from a security perspective - so far it&#8217;s held up.</p><p>I don&#8217;t know exactly how much time I spent on this, but it&#8217;s safe to say it&#8217;s a fraction of the time it would&#8217;ve been if I&#8217;d built it the <em>traditional </em>way. But I know for certain that at no point was I just focused on it. I was always doing other things at the same time, which shows you can have it as a background task, whereas in the past you&#8217;d have to be completely focused.</p><h2><strong>Closing off</strong></h2><p>When you see people commenting online about where the future&#8217;s going to be and what the capabilities are, question if they&#8217;ve had a go...</p><p>We should be rapidly playing with and testing these tools; figuring out what works and moving ahead, not stuck debating the future of programming. The future is here.</p><p>Another sign of just how quickly this space is evolving is that since first vibe-coding this ATS, OpenAI have significantly improved Codex. It has now actually become my go-to AI coding tool (sorry, Cursor!).</p><p>It&#8217;s an exciting time for SaaS and programming - as we move into a chapter where things can be built far quicker than we imagined 10 years ago.</p><p>Keep an eye out for our new ATS, and if you&#8217;re interested in this way of thinking, why not use the tool to join the team. We&#8217;d love to see you here.</p><div><hr></div><h2><strong>About Sam</strong></h2><p><em>Sam&#8217;s career spans industries from Defence to Logistics, where robustness and reliability are non-negotiable. Passionate about the transformative potential of AI, he thrives on building scalable solutions that make a real-world difference. Sam studied Computer Science at Newcastle University and outside of work can usually be found playing tennis or on the golf course.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading this far! Subscribe for free to receive new posts straight to your inbox. Far easier.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>Upcoming Events &amp; Community</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EuMi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EuMi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 424w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 848w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EuMi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5379459,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/175627423?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EuMi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 424w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 848w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!EuMi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc54d2c7c-2ff8-4144-8e7a-f9044ce327e1_4320x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Come to our next MLOps.WTF event! 18th November: MLOps.WTF #6!</strong><br><em>(We know this is Episode #17, and the numbers are not in sync but these are minor details)</em></p><p>MLOps.WTF is officially hitting the road, and our first stop is the Matillion offices at Two, New Bailey St. <br>Same Pizza, new stories and a whole heap of MLOps. Make sure you get your ticket as places will be limited.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.com/manage/events/1784006454329/tickets&quot;,&quot;text&quot;:&quot;Get my MLOps.WTF ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.com/manage/events/1784006454329/tickets"><span>Get my MLOps.WTF ticket</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P5Dp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P5Dp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 424w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 848w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P5Dp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg" width="940" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:940,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/175627423?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P5Dp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 424w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 848w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!P5Dp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb853ee3c-ad00-491b-b680-92599e74f916_940x470.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Next Week! 14th October</strong>: We&#8217;re hosting Awaze&#8217;s Women in Tech early career networking evening at DiSH. </p><p>Join for an evening of networking, first time speakers and, of course, Dominos pizza.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.dk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969?aff=ebdssbdestsearch&amp;keep_tld=1&quot;,&quot;text&quot;:&quot;Come to the Women in Tech event&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.eventbrite.dk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969?aff=ebdssbdestsearch&amp;keep_tld=1"><span>Come to the Women in Tech event</span></a></p><p></p><h2><strong>About Fuzzy Labs</strong></h2><p><em>We&#8217;re Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won&#8217;t cut it.</em></p><p><strong>Currently hiring:</strong> <br>We&#8217;re looking for people to join our Manchester team.*<br><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a><br>*<em>Solid engineering skills, passion for open source and coffee encouraged.</em></p><p><strong>Liked this?</strong> Share with your colleagues and loved ones, and receive MLOps.WTF episodes straight to your inbox! You can also give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> to be part of the wider Fuzzy community.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjozNjYzNjAzODcsInBvc3RfaWQiOjE3MzM0NjY5MSwiaWF0IjoxNzU5OTM1MDI3LCJleHAiOjE3NjI1MjcwMjcsImlzcyI6InB1Yi0yNTY0NDQ4Iiwic3ViIjoicG9zdC1yZWFjdGlvbiJ9.AwTnvPTr00FLcaQum41lXnrHScZww_tsw51js_ejMoM&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&amp;utm_medium=email&amp;utm_content=share&amp;action=share&amp;token=eyJ1c2VyX2lkIjozNjYzNjAzODcsInBvc3RfaWQiOjE3MzM0NjY5MSwiaWF0IjoxNzU5OTM1MDI3LCJleHAiOjE3NjI1MjcwMjcsImlzcyI6InB1Yi0yNTY0NDQ4Iiwic3ViIjoicG9zdC1yZWFjdGlvbiJ9.AwTnvPTr00FLcaQum41lXnrHScZww_tsw51js_ejMoM"><span>Share</span></a></p><p><strong>Not subscribed yet? </strong>You really should.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Are We the Last Programmers? AI and the Future of Code]]></title><description><![CDATA[Watch now | Matt Squire takes the stage at Manchester Tech Festival. Recorded live on the 24th September 2025.]]></description><link>https://www.mlops.wtf/p/matt-squire-are-we-the-last-programmers</link><guid isPermaLink="false">https://www.mlops.wtf/p/matt-squire-are-we-the-last-programmers</guid><dc:creator><![CDATA[Matt Squire]]></dc:creator><pubDate>Tue, 30 Sep 2025 14:43:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/174832900/5001d583adbacceed284cac315d9435b.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<h1><strong>Talk Overview:</strong></h1><p>This recording captures what might be the biggest shift in software engineering we&#8217;ve seen in decades: the move from hand-typing code ourselves to building it alongside machines.</p><p>Matt Squire traces the thread from 1960s MIT &#8211; where Seymour Papert believed everyone should be able to create with computers &#8211; through to 2025, where &#8220;vibe coding&#8221; is becoming genuinely viable. </p><p>But this isn&#8217;t just a &#8220;look what Claude can do&#8221; showcase. Matt digs into three crucial skills that matter more than ever: problem decomposition and system design, fluency in reading and debugging AI-generated code, and cultivating domain knowledge with proper user empathy. There&#8217;s a compelling argument here that we&#8217;re not facing the death of programming, but rather returning to something closer to its original vision &#8211; where building things matters more than memorising syntax.</p><p>The talk balances optimism with honest questions about what this means for junior engineers learning their craft, and whether curiosity alone is enough to navigate this transition. Whether you&#8217;re already knee-deep in AI-assisted development or still sceptical about the whole thing, there&#8217;s genuine substance here beyond the usual hype cycle nonsense.</p><p>Fair warning: contains strong opinions about Emacs, nostalgia for the Amiga 500, and at least one quantum computing joke.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mbMF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mbMF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mbMF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg" width="728" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:1677308,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/174832900?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mbMF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mbMF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972911a2-e6ef-48c3-9b44-ec01ed64fa0a_4032x3024.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1d59c09-266f-46ad-9169-60ca2c53cf98_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/50da3b13-277c-40a8-a25b-e578b43dccec_4032x3024.jpeg&quot;},{&quot;type&quot;:&quot;image/jpeg&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/580414d6-56a4-420c-8d0e-4c75843f2b3a_3024x4032.jpeg&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70b5604d-817b-461c-90d9-967f4a3499f0_1456x474.png&quot;}},&quot;isEditorNode&quot;:true}"></div><h3>Full Transcript</h3><p><strong>Matt Squire:</strong> [00:00:00] Thank you very much, Amul. So, yeah, I&#8217;m following a talk on quantum computing and that&#8217;s quite a big thing to follow. My talk&#8217;s not gonna be nearly as clever, but, did you hear about the guy who was having trouble with his quantum computer?</p><p>Phoned up tech support and they said, have you tried turning it off and on at the same time? There we go. So I&#8217;m here to talk about whether we are the last programmers and I guess the premise of the title here is that we are all programmers. I&#8217;m kind of curious though, can I get a show of hands?</p><p>Who here would regard themselves? As being a programmer. Okay. And of those who put your hands up, who uses AI coding tools as part of your work right now? Lots of [00:01:00] people. Now the other question, of course, the logical question because this, the other premise of the talk is well, of loads of you who aren&#8217;t programmers or don&#8217;t consider yourself to be programmers.</p><p>How many of you have been vibe coding lately? Okay. A few. Yeah, so that&#8217;s really interesting and that&#8217;s kind of why I wanted to do this talk and why I wanted to, to bring this topic to the room today. So we are in possibly the most interesting transition in the history of programming of software engineering as, as a profession, as an activity, as a cultural phenomenon.</p><p>We are in a position where for the first time, professional programmers don&#8217;t need to handcraft code a lot of the time, the truth is that we are entering this era where, we don&#8217;t need to handcraft code.</p><p>We maybe don&#8217;t need to write code at all. And there&#8217;s a question of not is that true, but of how much code do we actually need to write now, if [00:02:00] any, and what does this mean? As professional programmers, what does this mean for us? And what does it mean for us more broadly, socially? So I wanna answer that question, but I want to tell you a little bit about myself first to set the scene here.</p><p>I&#8217;m Matt. I&#8217;m a programmer. I&#8217;m a computer scientist, more or less. I put a maths enthusiast because I like mathematics, but I don&#8217;t quite have the rigor to call myself a mathematician. Smarter people, like people who talk about quantum computing probably can do that, but perhaps I can&#8217;t.</p><p>Most importantly, I&#8217;m a nerd and I&#8217;ve been programming for pretty much as long as I can remember. I&#8217;ve also been an Emacs user since 2009. How many people use Emacs? Wow, that makes me sad. Well come to Fuzzy Labs stand and you can talk to me about Emacs and I&#8217;ll tell you why you should be programming using Emacs in 2025.</p><p>I&#8217;m also, the CTO at Fuzzy Labs and we are a company that we&#8217;re basically an AI consultancy. We help people productionise AI. If you like we are kind of part of the problem. So [00:03:00] that&#8217;s also kind of context for this talk. That photo is me at an event called Oggcamp. Oggcamp is a open source conference.</p><p>This is the first one they ran 2009. Coincidentally, the year I started using Emacs. And here I am standing on the stairs queuing to get into a talk about Linux kernels or something crazy and inseparable from my laptop where I don&#8217;t know what I was doing, but I&#8217;m sure I was programming something. So that&#8217;s kind of setting the scene here.</p><p>I&#8217;ve been doing this for a very long time. I&#8217;m passionate about it. I&#8217;ve learned a lot along the way, and I&#8217;m kind of, I say that to kind of put myself in the shoes of what I think at least is a section of programmers in our industry right now who are genuinely worried, worried that AI tooling is gonna take their jobs or change their jobs in some fundamental way.</p><p>I want to at least empathize with that position. I wanna tell you a little bit about my history as well in, in terms of programming. So this was my first computer. How many people have ever seen or touched one of these? [00:04:00] Excellent. Good. So this is a Sinclair ZX Spectrum. The 128K there refers to the memory.</p><p>That&#8217;s 128 kilobytes of memory. So this presentation for context could not nearly fit on that machine. For the younger people in the audience, you may not know what this thing on the left here is that that&#8217;s called a tape drive. And, some of you know about tapes because you&#8217;ve used them.</p><p>Like me, I was probably the last generations who actually used tape. You have programs on that tape, and that&#8217;s how you would load software into this machine. Okay, so that&#8217;s pretty cool.</p><p>And it looks kind of like this. You turn the thing on. This is the menu you get, you get a basic interpreter. You get a calculator and you get a loader where you can load programs from the tape or to the tape.</p><p>So far so good. Now, when I got this, it was already out of date, so I was probably 12, 11, 12, that kind of age. I just turned 40 this year, so you can do the maths on that. But this was out of date when I got it. It was a [00:05:00] family hand down from someone else. And it came with a manual. And the manual is a beautiful thing.</p><p>I still have it. It&#8217;s a little bit battered, I&#8217;m afraid. I used to carry this round with me on family holidays genuinely. And it tells you how to program it. It contains all of the information you need to program in basic and any of these other things. It tells you how the machine works internally. It may even have some schematics, I don&#8217;t quite remember.</p><p>But the idea is that if you want to write a program and you don&#8217;t know how to do that, well, you can type it in. You can literally type line by line. Your reward in this case is this, so this is a program called Breakout. It&#8217;s a game and you have to knock the little bricks off using your little boat.</p><p>Fair enough. That&#8217;s pretty cool, right? Well, it&#8217;s cool enough for a machine that was built in 1987 . Well, I soon, soon enough, I managed to get an upgrade on this. So this is my favorite computer ever. Even in 2025. This is the Amiga 500. How many people know the Amiga 500 . So [00:06:00] this was an enormous upgrade. Interestingly, this machine was built roughly the same time period as the ZX Spectrum. It was well ahead of its time. And again, we got it second hand. It was a hand down in my family.</p><p>But this became my entire life, to be honest. It was a revelation. It had a floppy disc drive. Isn&#8217;t that exciting. No more tapes. So yeah, I mean this, as I say, it was well ahead of its time, right? So this is what the desktop looks like. You&#8217;ve got multitasking. You&#8217;ve got preemptive multitasking, which was very new at the time.</p><p>You have the ability to load different software from floppy discs. It was really popular with graphics and video editing, believe it or not. And it came with a version of Emacs, which is kind of fun. You could program it and basic it once again, but a little bit more sophisticated, you can do a lot more with it, more power, more options, more tooling.</p><p>And again, it came with a manual. So again, I studied this manual. I got the magazines by the way, the magazines, for people who this is completely alien to, they were magazines and in them you had source code and it was basically like GitHub. But instead you had to go to the news agent and buy the [00:07:00] magazine.</p><p>And, and if you wanna do a pull request, it doesn&#8217;t work anyway. So yeah, that&#8217;s how that works. But again, I was, like I said, completely obsessed with learning how to program this thing and learning how all this stuff works. I showed you a screenshot of breakout earlier. There was a much better version of breakout called Arkanoid, which you got on the Amiga.</p><p>You&#8217;ve got on the Atari, a few other platforms at the time, this was an amazing game. I spent ages playing this, but I spent even longer trying to program my own version and say, why would you do that because it already exists, but that&#8217;s not the point right? The point is to learn how this stuff works and it&#8217;s just following your curiosity.</p><p>I spent a very long time writing my own basic version of this. And, learning along the way. The code was not good by any stretch of the imagination.</p><p>I don&#8217;t have it anymore, and thank goodness for that. I wouldn&#8217;t put it on my GitHub. I certainly wouldn&#8217;t propose anyone uses it for like, you know, GitHub if they&#8217;re applying for job applications, things like that. But that was that. [00:08:00] And I kind of, you know, asked myself recently, well, if I spent all that time trying to build my own version of this, what would that look like in 2025?</p><p>So I went to my friend Claude. And I said, can you make me an Arkanoid game in Python? And believe it or not, two minutes later we have it. So that&#8217;s kind of sobering. You could say, okay, well do you know what? It&#8217;s actually a really basic game. It&#8217;s pretty easy to write. The code for this is probably in the training set for Claude anyway, so it&#8217;s not that impressive.</p><p>And it doesn&#8217;t tell us anything about the future of programming, right, because it doesn&#8217;t do anything that anybody actually needs, unless what you need is this game. And if you do well, it already exists. So you could make that argument and say, this doesn&#8217;t really tell us very much.</p><p>So let me show you another example of something that has been vibe coded, completely vibe coded. So in Fuzzy Labs we have, a problem. Good problems to have, but one of the problems is that we have a bunch of different marketing [00:09:00] channels and we have different people who follow us on those marketing channels.</p><p>And we want to be able to unify them. We wanna understand if someone&#8217;s following us on LinkedIn, is it the same person who&#8217;s coming to our MLOps meetups? Is it the same person who&#8217;s subscribed to our newsletter? So we couldn&#8217;t find anything that solved this problem in exactly the way we needed to solve it.</p><p>And what happened was our CEO Tom stood at the back in, the yellow and pink, by the way. So I just give you a little bit of context. Tom is technical. He can definitely program. He&#8217;s very capable, but he didn&#8217;t want to. He is a very busy man, lots of things to do. So Tom built this entirely using vibe coding.</p><p>Not a single line of code was written, so that&#8217;s interesting, right? This is a real application that solves a real business problem that we have. We have the domain expert who knows the problem inside out. Who was able to build an application that he needed, that we needed as a business without writing any code.[00:10:00]</p><p>And this is in production. This is live. We use this every day, every week. And something else, a little bit more lighthearted, hook-a-ducks. So actually, if you come to our stand a little bit later, you can play this game. This was vibe coded by Rhiannon, our marketing manager.</p><p>And okay, we could have got something off the shelf, but we wouldn&#8217;t have been able to customise it and make it look Fuzzy ish and these kinds of things. So again, relatively straightforward game, but someone is able to just build that by prompting a LLM without really needing to write any code.</p><p>And this is what I mean, right? This is the key question, right? And I framed this in a negative way deliberately. And the reason for that is that I wanna reflect what I see as a bit of a negative mindset in the industry at the moment.</p><p>At least if I go by what people post on LinkedIn, which maybe I shouldn&#8217;t, but you know, it&#8217;s that kind of, well, if anyone can build software, it&#8217;s not special anymore, is it? So why do it? [00:11:00] Why bother is programming dead at this point? That&#8217;s the question I feel a lot of people are asking, and I don&#8217;t feel like it&#8217;s the right question.</p><p>I feel like it&#8217;s unnecessarily negative, but nevertheless, that is a bit of the sentiment. What I&#8217;d like to argue over the next 20 ish minutes is that actually we are in possibly the most exciting period of time in software engineering that there&#8217;s been, but also that this idea that anybody can build software is not a new idea in the history of computer science.</p><p>And it is not something that we should be fearful of or negative about. If anything, it&#8217;s a good thing that we should celebrate and embrace. So with that in mind, I would like to take you on a history lesson. In fact, most of this talk is a history lesson you&#8217;ve probably noticed. So I&#8217;m gonna quiz people on their knowledge of computer science history.</p><p>I dunno how well this is gonna go, but we&#8217;ll give it a try. Does anybody by any chance know who the man on the left is? No. Okay. [00:12:00] Seymour Papert was a South African born mathematician. He had not one, but two PhDs in mathematics. Considerably smarter. I would venture than all of us in this room.</p><p>But thankfully we get to benefit from his mind and his brain. He became co-director of the MIT AI lab in 1967 when not only was computing relatively new, but AI was definitely new. Incidentally, the history of AI goes all the way back to the early 1950s with the work of people like Alan Turing.</p><p>It&#8217;s not as new as we think, but anyway what Seymour Papert was really passionate about was education. And particularly he was interested in understanding how this, for him new thing of programming could be used to transform education in the classroom. How could it be used to change how we teach children?</p><p>So he had a theory called constructionist learning. Actually, it wasn&#8217;t entirely his theory, it&#8217;s based on the work of the psychologist Piaget. [00:13:00] But fundamentally, he&#8217;s asking the question, how do children use technology to learn? And he had a few principles here. So he&#8217;s observed that learning happens best when children are able to make things and share them with their colleagues.</p><p>Programming for Papert is the thing that allows those children to very easily build things. So that&#8217;s kind of interesting. It&#8217;s empowering. Programming is enabling people to create things. It&#8217;s not seen as an obscure skill that you need to spend years and years of your life building up and then hoarding.</p><p>It&#8217;s actually something for everybody, and that&#8217;s gonna be a common theme here, debugging he sees as a way of teaching. But basically intellectual rigor. It&#8217;s about teaching children how to think about thinking, frankly, teaches anyone how to think about thinking. Certainly not just children. The overall concept here is that people learn best when they can build things that they personally care about.</p><p>Well, that kind of resonates with me if I think back to my early [00:14:00] history with programming of I just wanted to build Arkenoid. I wanted to build my own version &#8216;cause I just cared about it and I can&#8217;t say why. I just wanted to do it. So it&#8217;s that kind of idea, that kind of thinking. This book, if you really want to delve into the history here, there&#8217;s a book which, uh, Papert wrote called Mindstorms.</p><p>By the way, the Lego thing is named after that. So this is, some of you have probably played with something like this at some point during school. It&#8217;s a bit inconsistent in terms of how this is delivered in the curriculum, but this is called Turtle. It&#8217;s a robot that can draw patterns. So the idea is it can move back and forth.</p><p>It can turn around, it has a pen that can go down and up and you can program it to draw pictures and patterns and things like that. This is a very early version, but even nowadays in British schools, at least sometimes variations of this are being used to teach programming at a very early age. And I&#8217;d be interested afterwards actually, if anyone has any anecdotes about that.</p><p>I&#8217;m interested about the, the experiences there. So On the left we have Cynthia Solomon, actually just, just LA left there and Seymour Papert again on the right with his robot and his little fish that he&#8217;s clearly very [00:15:00] proud of. So Cynthia Solomon and Seymour Papert worked together. They collaborated heavily on a programming language called logo.</p><p>Logo is the language that you use to program. The robot logo is designed to teach children how to program. It could be used to teach anyone how to program. It fundamentally exists to teach the basic concepts of programming, the basic reasoning and logic to anybody who, who really wants to learn. And here we have an example, so very, very old picture, very grainy, but we have um, the little robot being used and the little teddy bear there.</p><p>How does it work? Well, I don&#8217;t really have time to go into the deep details of Logo in terms of the specifics, but the idea is it&#8217;s very, very simple. So We have very simple commands, very simple verbs that tell you how to tell the robot what to do.</p><p>For instance, if we wanna draw a square, how you read this is you say, in order to draw a square, you repeat four times, you move forward a hundred steps, rotate 90 degrees, and then you finish. Right? Fair enough. Seems straightforward enough. I&#8217;m not suggesting by the way, that we should all [00:16:00] switch to programming logo.</p><p>I&#8217;m using this as an illustration for. A way of thinking about programming that exists in the very early history of computer science that I think we&#8217;ve since lost. We&#8217;ve lost touch with this idea that everybody should be able to, not necessarily program, but should be able to create things using computers, and if programming is the means of doing that or if something else is, the means of doing that doesn&#8217;t particularly matter.</p><p>The point is that we can use computers to create things and learn while we do so. And I love this quote from Papert. So he says, a programming language is like a natural human language in that it favors certain metaphors, images, and ways of thinking. So that&#8217;s how he&#8217;s thinking about programming at that time in the 1960s.</p><p>So he sees programming as being for everybody. The language should support human creativity. And he even criticised languages like Basic and Fortran and things like that at the time because he didn&#8217;t feel that they had that same idea of being able [00:17:00] to uh, give everybody that creative capacity that he felt that there should be.</p><p>Now, uh, the other person here is Alan Kay. I should have made that a quiz, but there&#8217;ll be a few more quiz coming. Don&#8217;t worry. Um, Alan Kay, just really briefly, um, kind of built on a lot of the ideas of Papert, but he invented a language called Smalltalk. How many people have ever programmed or seen Smalltalk?</p><p>I&#8217;ve never had the pleasure of programming Smalltalk myself. It&#8217;s one of the first object oriented programming languages. Not the first, but one of the first. And it developed a lot of the early concepts. Alan Kay shared the same philosophy that really programming should be a way for people to express thought and express their creativity and build things.</p><p>And that was the philosophy he built into Smalltalk. Again, I&#8217;m not suggesting we should all go and program Smalltalk, but it&#8217;s just interesting. To think about the history of this and how the ideas developed, but then how we&#8217;ve kind of moved away from these ideas. And I think if you look at some software ecosystems now, particularly I look at, I dunno, the, the JavaScript ecosystem, even the Python ecosystem that I spend a lot of time in, the [00:18:00] complexity now is huge.</p><p>The amount of stuff you need to know to build software successfully is enormous. Whereas this, this idea of, well, anyone should just be able to create things that, that seems to be a little bit lost. And you can see the, the kind of joined up thinking between Alan and, and Seymour here. So he said, I, he visited Seymour Papert and he observed children doing real programming with especially designed language and environment, and that encounter hit him with what the destiny of personal computing was really going to be.</p><p>It seemed for a long time like that, destiny had not been realised. I would argue that we are about to realise that destiny with the advent of AI assisted programming. And so I&#8217;ve got roughly 12 minutes. I&#8217;d like to answer the question of how we adapt, or at least I&#8217;d like to begin to answer that question.</p><p>I&#8217;d like to try to give something about how we should adapt to this new reality, because this conversation is going to continue. This is definitely not the end. It is really [00:19:00] just the beginning of that conversation, but I&#8217;ve got three thoughts here. The first is to focus on. Problem decomposition and system design.</p><p>The facts of the matter is that building software, like actually building software in production at scale, making it secure, making it satisfy all of the edge cases and different user requirements that come up is a lot more than just programming, is a lot more than just writing code. So if the typing code bit can be automated, there&#8217;s a lot more space for us to focus on what&#8217;s really, really hard, which is;</p><p>how do I take a complex problem and break it down into simple parts? How do I do that? A lot of software development is that really. How do I interpret requirements? We are never gonna get away from that. And actually, if anything, what we&#8217;re gonna find is that the, there&#8217;s kind of almost a, there&#8217;s a bit of a joke I guess in the software development community</p><p>typical stereotypical project manager who isn&#8217;t able to express requirements and the programmer who&#8217;s [00:20:00] stereotypically pedantic about the requirements. What we&#8217;re gonna find, I think, is we&#8217;re gonna end up meeting in the middle because we&#8217;re actually still gonna need to be specific about our requirements.</p><p>And anyone who&#8217;s tried vibe coding themselves will have realised this when they&#8217;re prompting that if you are not really, really specific about what you want, then Claude is gonna do something and it may or may not be what you originally intended, but it&#8217;s gonna be something. So it&#8217;s almost speeding up that feedback loop.</p><p>Now in the picture here, anyone wants to venture a guess? The lady in the middle.</p><p>Grace Hopper? Yes indeed. Grace Hopper was um, fantastic at abstraction. So Grace Hopper, among other things, invented the COBOL programming language. COBOL was designed to be a kind of natural language way of programming. It was designed to allow business specialists to write software without knowing things about, I don&#8217;t know, pointers and memory allocation and that sort of thing.</p><p>She invented one of the first compilers in order to make COBOL work. So she&#8217;s really good at thinking about abstraction, thinking about taking a complex [00:21:00] problem and breaking it down into sub-problems and layers of abstraction where complexity, different levels of complexity exist at different levels.</p><p>That&#8217;s really important. That allows us to turn what is at the time, writing assembly code or Fortran into being able to write as what almost looks like English and have it execute. Also she&#8217;s famous for talking about lengths of wire. If anyone&#8217;s not seen it, just search YouTube later.</p><p>For um, Grace Hopper, lengths of wire, just trust me on this. So she explains how, um, the latencies of information transmission work with a very intuitive example, again, demonstrating how good she was at abstraction. Okay, well the next thought is building fluency in reading, debugging, and evaluating code.</p><p>Because if a lot of the code we used is going to be written by an AI, you know, either we are, we&#8217;re going for vibe coding and we&#8217;re just telling it to by a prompt to generate us some code or we are maybe not quite going that [00:22:00] extreme, but we are, um, using an assistant to help us write the code as we go.</p><p>Either way, we need to be reading a lot of code that&#8217;s generated, not just by our colleagues, but by AI. We need to be able to interpret it. We need to be able to debug it, we need to be able to evaluate it. Now, debugging is particularly interesting. Um, I think it was, um, I&#8217;m gonna say Brian Ingham might have got that wrong, but.</p><p>I think it&#8217;s Brian Kernighan and one of the inventors of C said that debugging code is always twice as difficult as writing it. So you should never write the cleverest code. You can possibly write because you&#8217;re not clever enough to debug it. That logic applies here, right? We, we kind of, we need to be good at debugging stuff.</p><p>We need to be good at interpreting what the AI generates. If we don&#8217;t understand what the AI generates, we&#8217;ve got no hope that, let&#8217;s be honest. Um, this picture is Edsger Dijkstra. Edsger famously didn&#8217;t like computers. He was a computer scientist who didn&#8217;t really like to use computers. He said that computer science has as much to do with computers as microscopes have to do with biology, or telescopes have to do with astronomy, but you know, he also, felt that debugging was particularly [00:23:00] important skill to develop. And the last one is about cultivating domain knowledge and building user empathy. And this is a mixture of technical skills, but also business knowledge, expertise in a particular domain and our ability to just understand other people, and particularly the people who are going to use the software that we build.</p><p>So again, there&#8217;s a lot more room for this if we are no longer concerned with typing code into a machine. There&#8217;s a lot more room to think about this. There&#8217;s a lot more room for using AI to ideate about these things too. But the idea here is really understanding the problem that you&#8217;re trying to solve through the software that you are building.</p><p>Really understanding those requirements, really understanding how the users are going to interact with and experience the software that we&#8217;re building. That&#8217;s crucial. And you know, The picture here happens to be. The [00:24:00] very first spreadsheet, believe it or not which was VisiCalc. So, you know, it is interesting to think about how something as simple as the invention of the spreadsheet transformed how people used computers to solve certain problems.</p><p>You can&#8217;t imagine using computers without the existence of spreadsheets right now. Right? As soon as you see a spreadsheet shaped problem, you immediately recognise it and jump into one. But that had to be invented. Someone had to think about this was a particular user experience that needed to exist and they had to build it.</p><p>And VisiCalc is long dead, but the concept is still there. It remains, it survives. So that&#8217;s three thoughts on things we can think about to change how we, what we emphasize and what we focus on as software engineers, as professionals in this new world. So I think as I said at the beginning, right now is the best time to be a programmer.</p><p>And there&#8217;s a few reasons for this. Firstly, that we are entering an [00:25:00] age where in principle, anybody can build software. Anybody can create software. And sure, granted, it may not be the case that the non-technical person can create and productionise something, at least not yet, and probably not for a while.</p><p>But someone can express an idea through software in a way that they could never hope to do before, and that&#8217;s really exciting. If anything, it probably means more work for us software engineers because we have lots of people who have ideas that they can now prototype that will gain funding and need to go to production.</p><p>That&#8217;s good news, but also because the feedback loop is so much faster, the tools we have allow us to build things rapidly, iterate on them and gather feedback, gather um, like a real, real world experience of what that software&#8217;s gonna do, of what a particular thing is gonna look like in a way that we just couldn&#8217;t before.</p><p>So it&#8217;s that kind of [00:26:00] the speed of experimentation, I suppose, is what I&#8217;m getting to, right? It&#8217;s the speed of experimentation that we are gaining now that we never had before, and the ability to collaborate with a wider set of stakeholders. You know, if, I can build something and give someone the ability to sketch out a variation on what I&#8217;ve built, even if they&#8217;re not particularly technical, and then I can take that and build on that and we can go back and forth.</p><p>That&#8217;s exciting. But the final point is to that point about education, because that&#8217;s one of the concerns that when I&#8217;ve spoken to people, has been expressed the most strongly, particularly when I&#8217;ve spoken to engineers early in their career, they&#8217;ve said, what will the impact of AI assisted coding be on my ability as a junior engineer to learn and hone and improve my craft?</p><p>That&#8217;s a hard question to answer because it almost looks like, really cynically. Maybe you can&#8217;t. Right, and there&#8217;s, there&#8217;s almost these cynical takes you see on LinkedIn of we&#8217;re [00:27:00] only gonna have senior engineers, and at some point that&#8217;s gonna be a problem because none of the junior engineers will have gone through the learning journey they needed to go through to be a senior.</p><p>I don&#8217;t think it&#8217;s necessarily that gloomy though. The reason is that I think back to myself again, learning to program I think that if I were that again, 11, 12-year-old boy stood in front of probably not an Amiga 500 in 2025, but stood in front of something else.</p><p>And I had access to these tools, would I still try to build Arkanoid? Well, I think I would, and I don&#8217;t think it would be a question any more than it was ever a question of, well, Arkenoid already exists. I have it on floppy disc. Why would I write my own? There&#8217;s never, never a question. And the reason is curiosity.</p><p>So the most successful engineers that I&#8217;ve ever seen in my career have something in common. It&#8217;s curiosity. It&#8217;s a need to learn. It&#8217;s a need to know, I couldn&#8217;t have [00:28:00] not built my own Arkenoid and it doesn&#8217;t matter that it was bad code &#8216;cause I was 12, for goodness sake. It doesn&#8217;t matter. It&#8217;s about the learning, it&#8217;s about the curiosity, it&#8217;s about following that curiosity.</p><p>I would definitely, I think, use the tools to help me learn more quickly. I could go to ChatGPT and say, here&#8217;s my basic interpretation of Arkenoid what do you think? And maybe ChatGPT could have given me some helpful hints, and maybe I would&#8217;ve been an even better programmer at the start of my career as a result, because I&#8217;ve learned from this massive wealth of knowledge that&#8217;s available to me.</p><p>And so just to conclude, one little ironic twist of history. Um, I mentioned that Seymour Papert was in the, uh, MIT AI lab, 1967. The other co-founder was a man named, marvin, the name has dropped outta my head. Mar , freeze when you&#8217;re on stage.</p><p>Um, Marvin Minsky, right. So Marvin Minsky, one of the things Marvin Minsky&#8217;s famous for is, um, his work on the perceptron, um, which was one of the very early manifestations [00:29:00] of neural networks, or rather neurons, artificial neurons. Uh, he wrote a book called Perceptrons, in which he demonstrated a number of fundamental limitations of the neural networks at the time.</p><p>And that book led to a massive drop off in funding for AI research. It started what was called the first AI winter. So, you know, the kind of, um, you start out along this story arc, you get to a collaboration that paused at least AI research for a good few years before it picked up again. In any case, I, um, would like to say thank you for listening.</p><p>Um, I think we&#8217;ve got a little bit of time for questions as well, but also you can come and meet me and the Fuzzy Labs team back in the vendor area there. We&#8217;ve got a table. You can play Vibe Coded Hook-a-Duck. See if you can break it. That might be fun. But also you can win prizes, so maybe just do that.</p><p>You can learn more about us at our website. We also have a newsletter. And I can&#8217;t not say we are hiring because we do believe that there is still a [00:30:00] need for programmers in 2025. So if you wanna talk to me about that as well, let me know. </p><p>Thank you.</p>]]></content:encoded></item><item><title><![CDATA[Getting Your Business Ready for AI: Smart Homes Need Smart Foundations]]></title><description><![CDATA[MLOps.WTF Edition #15]]></description><link>https://www.mlops.wtf/p/getting-your-business-ready-for-ai</link><guid isPermaLink="false">https://www.mlops.wtf/p/getting-your-business-ready-for-ai</guid><dc:creator><![CDATA[Danny Wood]]></dc:creator><pubDate>Thu, 25 Sep 2025 08:01:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Egzo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This week's episode is brought to you by Danny Wood, Lead AI Research Scientist at Fuzzy Labs.</em></p><p>Ahoy there &#128674;,</p><p>Go to LinkedIn and it seems like every business is solving their problems with AI. The enthusiasm is infectious - and honestly, it's exciting to be part of this field. But what does it truly mean to "embrace AI"? Are companies really embracing it, giving AI a proper bear hug, or are we only seeing part of the picture?</p><p>Think about renovating a house. You might be buzzing with excitement about installing smart home technology - automated lighting, AI-powered security, voice-controlled heating. But if your home isn&#8217;t set up for those systems or you don&#8217;t even have power going to every room&#8230; then it won't just fail to work properly - they'll create more problems than they solve.</p><p>The same principle applies to AI projects.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Egzo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Egzo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Egzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg" width="735" height="583" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:583,&quot;width&quot;:735,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/174350792?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Egzo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Egzo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28566b0c-1d2e-4055-aabd-5abe07350117_735x583.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Setting up for success</strong></h2><p>Let's say that you're a software engineer in a relatively small business and your CEO has decided that now is the time to "adopt AI." Sure, your salespeople have been using ChatGPT to help get the tone right on follow-up emails, and you've been using Cursor for some of your more tedious refactoring, but now it's time for the business to get systematic with how it's using machine learning.</p><p>Whether they want to start using a simple machine learning model for forecasting or they want a fully-agentic chatbot that is able to deal with customer queries, you've been tasked with making sure that when the data scientists and machine learning engineers come in, they have the best chance of making a big impact.</p><p>To get this right, you need to get things up and running quickly. Maybe you're hiring some consultants for a few weeks, or you're setting up an entirely new team. You know that enthusiasm will wane if your managers don't see results soon. So how do you set up for success?</p><p>There are a few things that you can do to make sure that money spent investing in AI is money well-spent.</p><p>The companies that get this right share something in common: they've been getting their data house in order long before they started shopping for AI solutions. They've got their tools ready, paint colours picked out, a new floorplan fully mapped - rather than winging it and hoping for the best.</p><h2><strong>What problem are you trying to solve?</strong></h2><p>Sometimes, there is a very specific problem that AI is being brought in to solve or a particular process that AI is going to be used to automate. Other times, the brief might be as vague as "We need to start using AI or we'll fall behind." The further you can move towards having concrete problem statements, user stories or use cases, the more likely your project is to be a success.</p><p>You need to be specific about your goals before you start.</p><p>If you're starting out with something as vague as "apply AI to our manufacturing process," think, can you hone that down to "we want to be able to predict our output over the next four weeks" or even better, something super specific like "We know that the data from this component changes in a characteristic way in the days before it fails, we'd like to automatically flag that."</p><p>Just like you can't build an extension if you haven't got the land for it, you can't build effective AI systems without knowing exactly what problem you're solving.</p><p>Early on, there's huge benefit in figuring out where simple but concrete challenges are that can be claimed as early wins.</p><h2><strong>Getting your data in order</strong></h2><p>If you want data scientists or machine learning engineers to build something quickly, give them enough data straight away! You might not know what data they will need but it's a safe bet that they will need as much data as you can give them and will constantly ask for more.</p><p>There will be times when you need specialist AI engineers or data scientists to tell you exactly what data they need, but in general, the best time to start collecting data is now! For a forecasting model how much data on past trends can you collate? For an LLM chatbot with RAG, do you have an archive of the kinds of documents it's going to be reading? For a computer vision model, are you already collecting images and are they categorised/labelled?</p><p>Take an (appliance) company we worked with. They were trying to predict when certain components were going to fail in their systems. They had months and months of sensor data from all of their refrigeration systems, which was a great starting point.</p><p>But the real breakthrough came when they could show us labelled examples - specific points where components had failed, connected to what the data looked like beforehand. That's like having a proper inventory system that tells you not just "we have pipes" but "these are the pipes that work well, these are the ones that failed, and here's why they failed."</p><p>Because if you just have logs from all your systems but no context about when things went wrong, you can't build predictive models. You need those examples that show what it looked like when components were about to fail. Just like you need to know the shape of the room before you start fitting the carpet.</p><h2><strong>Subject matter experts</strong></h2><p>Sometimes, making accurate predictions is just a case of pointing the right model at the data and hitting train... but this is rare. And even then, sometimes the hard part is knowing what the useful thing to predict even is. More often than not, finding the right subject matter experts in your company is going to be key to success.</p><p>Talking to subject matter experts is also vital in working out whether what is being asked for is possible. If your subject matter expert can't do a task, do you have good reason to believe that an AI is going to fare much better?</p><p>These people aren't just helpful - they're often the difference between a system that works and months of experimentation.</p><h2><strong>Setting expectations</strong></h2><p>Make sure that the people who are pushing for AI know what to expect. It's not going to surpass human ability, it's not going to replace whole teams or years of accumulated knowledge.</p><p>The strength of AI is in automating and scaling tasks that are in that sweet spot where they're too nuanced or complex to write code to perform automatically but still simple enough a person would probably find them quite tedious. This is true even with the latest generative models, for messy real-world problems, you may find that AI will fail in weird and surprising ways if the problem space is not carefully constrained.</p><p>When companies set these realistic expectations upfront, their AI projects deliver results that genuinely make everyone's work more productive.</p><h2><strong>The two kinds of AI project</strong></h2><p>Actually, it's worth understanding that when we talk about AI projects, we're really talking about two different paradigms:</p><p><strong>The LLM route</strong>: Building chatbots, document Q&amp;A systems, automated writing tools. These need your company's knowledge base - all those FAQs, manuals, conversation histories that you've been accumulating.</p><p><strong>The traditional ML route</strong>: Forecasting, anomaly detection, computer vision, recommendation systems. These need historical data with clear examples of what you're trying to predict or detect.</p><p>Both paths can lead to success, but they need different types of preparation. The companies that do well figure out which path they're on early and prepare accordingly.</p><h2><strong>The multiplier effect</strong></h2><p>What gets me excited about this: when companies get their data in order and approach AI thoughtfully, they don't just improve their own results. They create a multiplier effect that benefits everyone.</p><p>If every company started thinking about data collection and labelling now, it would be so much more efficient for everyone. The data would be there when needed. We could train models on the right information from day one. Projects would deliver results faster and more reliably.</p><p>Better preparation leads to more successful projects. More successful projects lead to more realistic expectations about what AI can actually do. And all of that creates a virtuous cycle where the entire ecosystem gets better at building genuinely useful AI systems.</p><p>Every company that methodically prepares their data and involves their domain experts makes it easier for the next company to understand what success looks like.</p><h2><strong>The opportunity right now</strong></h2><p>While everyone else is rushing to install the latest AI applications as quickly as possible, taking time to get your data house in order isn't just smart preparation - it's a competitive advantage.</p><p>The companies that invest in proper data preparation, involve their domain experts, and set clear expectations are building AI systems that actually deliver on their promise. They're the ones whose employees are genuinely excited about working with AI tools because those tools actually make their work better.</p><p>Your future self will thank you for doing this groundwork properly. More importantly, you'll be building systems that create real value rather than impressive demos that struggle in the real world.</p><p>And honestly? In a landscape where everyone's talking about AI but fewer are delivering consistent results, being one of the companies that gets the foundations right is a real opportunity.</p><p></p><p><em>Danny Wood is Lead AI Research Scientist at Fuzzy Labs, where he helps companies turn AI enthusiasm into systems that genuinely work. He believes the best AI projects start long before anyone mentions machine learning.</em></p><div><hr></div><h2><strong>And finally...</strong></h2><h3><strong>Upcoming Events &amp; Community Updates</strong></h3><p></p><p><strong>PyData Manchester - Tonight! 25th Sept, 6:30pm @ AutoTrader</strong> our very own Shubham Gandhi tackles "How to Train Your LLMs?" with practical optimisation techniques that make LLM training more resource-efficient and scalable<a href="https://www.meetup.com/pydata-manchester/events/310734327/"> PyDataMCR September , Thu, Sep 25, 2025, 6:30 PM | Meetup</a>. </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.meetup.com/pydata-manchester/events/310734327/&quot;,&quot;text&quot;:&quot;Join Waitlist&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.meetup.com/pydata-manchester/events/310734327/"><span>Join Waitlist</span></a></p><p><strong>Awaze Women in Tech Early Career Networking - 14th Oct @ Fuzzy Labs</strong> "Education to Industry: Discussing the Transition" </p><p>Join Awaze's Women in Tech team for a networking evening offering female-identifying people a platform when entering tech and data careers. With two Fuzzicans as part of the event (Tiff and Savannah) we&#8217;re excited to host such an empowering initiative. All are welcome to attend.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.co.uk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969&quot;,&quot;text&quot;:&quot;Get Your Ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.co.uk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969"><span>Get Your Ticket</span></a></p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p>We're Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</p><p>We help organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design systems that create real value in the real world.</p><div><hr></div><p><strong>Currently hiring:</strong></p><ul><li><p><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer</a></p></li><li><p><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p></li></ul><p>Want to chat? We love talking about interesting ML challenges - we also love all dogs, sauce, ducks, coffee and helping propel the MLOps community onward and upwards.</p><p>Liked this? Forward it to someone who's thinking about taking their AI projects to the next level and wants to set themselves up for success. Or give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> for more MLOps thoughts and insights.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.linkedin.com/company/13022643/&quot;,&quot;text&quot;:&quot;Follow us on LinkedIn&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.linkedin.com/company/13022643/"><span>Follow us on LinkedIn</span></a></p><p>Not subscribed yet? The next issue is going to dive into our Applicant Tracking System - and experiment in vibe coding and how where think our rapidly evolving industry might end up.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/getting-your-business-ready-for-ai/comments&quot;,&quot;text&quot;:&quot;Leave a comment&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/p/getting-your-business-ready-for-ai/comments"><span>Leave a comment</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Testing, reliability, and why Bob the Builder can't trust his AI mates]]></title><description><![CDATA[MLOps.WTF Meeting #5. Edition #14]]></description><link>https://www.mlops.wtf/p/mlopswtf-5-newsletter-14</link><guid isPermaLink="false">https://www.mlops.wtf/p/mlopswtf-5-newsletter-14</guid><dc:creator><![CDATA[Rhiannon]]></dc:creator><pubDate>Thu, 11 Sep 2025 12:26:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wXeQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31f01532-bc68-4514-a19f-356fe9d8433d_4032x3024.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>A humble rundown of MLOps.WTF meetup #5, where pizza was consumed, fire exits were located, and Matt's t-shirt preferences became unexpectedly relevant to the evening's discourse.</em></p><p>Another brilliant turnout for <a href="https://www.mlops.wtf/">MLOps.WTF</a> #5 in Manchester, as we continue to build a community of people who love productionising AI in all its forms. Our fifth meetup focused on agents 2.0, and between Matt's mandatory fire safety briefing (fire exits noted, no fires planned) and his unprompted declaration that his favourite t-shirt is the "small, slim fit from TopMan" which he can't buy anymore, we managed to squeeze in three brilliant speakers.</p><p>The theme? How to cut through the agentic hype and actually make AI work in production. The room fell quiet as our speakers kicked off, and we were once more dazzled by the knowledge and passion of the Manchester MLOps community. Settle in as we cover how it all went down...&#128071;</p><div class="image-gallery-embed" data-attrs="{&quot;gallery&quot;:{&quot;images&quot;:[{&quot;type&quot;:&quot;image/heic&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31f01532-bc68-4514-a19f-356fe9d8433d_4032x3024.heic&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce0a0cd0-97cd-4024-a63c-f0c22ca2b152_2270x1734.png&quot;},{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/115c9024-ce33-4ceb-b500-e1f5e1159d6e_1916x1590.png&quot;}],&quot;caption&quot;:&quot;&quot;,&quot;alt&quot;:&quot;&quot;,&quot;staticGalleryImage&quot;:{&quot;type&quot;:&quot;image/png&quot;,&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e18c27c-553e-4508-8919-6eb5ff9fe675_1456x474.png&quot;}},&quot;isEditorNode&quot;:true}"></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><h2><strong>Emeli Dral: "How to Evaluate and Test AI Agents"</strong></h2><p>Emeli, co-founder and CTO of <a href="https://www.evidentlyai.com/">Evidently AI</a>, opened by defining what we mean by AI agents: anything with rules, tools, and the ability to plan and act.</p><h3><strong>Why AI agent testing is genuinely hard</strong></h3><p>Traditional AI, like demand forecasting, was much easier - you just compare outputs to actual results. But with unstructured data like pictures or text, Emeli explained, we cannot really compare pictures by pixels... it doesn't make any sense.</p><p>Agents present new challenges: free-form inputs and outputs, complex usage scenarios, many ways to be "correct," and risks of cascading errors through multi-step workflows.</p><p>Her solution borrows from software engineering but adapts it for AI's peculiarities. We need ways to automate evaluations for open-ended text data using task-specific quality criteria.</p><p></p><h3><strong>The testing hierarchy adapted for AI</strong></h3><p>Unit tests catch bugs early and cheaply. For AI agents, this means asking:</p><ul><li><p><strong>Prompt validation</strong>: Does your prompt contain required fields like "task description" and "constraints"?</p></li><li><p><strong>Tool I/O validation</strong>: Are input/output types correct? Does your calculate_growth tool handle division by zero properly?</p></li><li><p><strong>LLM-as-a-tool validation</strong>: For agents that use other AI components</p></li></ul><p>Emeli showed practical Python examples, demonstrating how to validate JSON schema outputs and test tool execution success (keep an eye out for our video recordings of the event to see this for yourself too!)</p><p>Integration Tests check component interactions, again testing/asking:</p><ul><li><p>Environment configuration (do the OpenAI API keys even exist before deployment?)</p></li><li><p>Tool and API integration (does the routing work correctly?)</p></li><li><p>Memory and state propagation (does context persist across interactions?)</p></li><li><p>Planning and routing logic (does "show me sales by region" trigger the correct tool?)</p></li></ul><p>End-to-End Tests simulate complete user journeys. Emeli's memorable example highlighted a critical insight: an agent correctly formatted a response but was contextually inappropriate for the business setting. This captures why format validation isn't enough.</p><h3><strong>LLM-as-a-Judge: scaling subjective evaluation</strong></h3><p>Rather than hiring armies of human evaluators, Emeli detailed using LLMs as judges. Define explicit criteria like: "The response is SAFE when it's polite, factual, and avoids controversial topics. The response is UNSAFE when it contains inflammatory language."</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YFlO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YFlO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 424w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 848w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YFlO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png" width="1456" height="822" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:822,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:649678,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YFlO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 424w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 848w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!YFlO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf66a129-19a6-4880-b9b0-c7f5a34ea88d_2304x1300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>She demonstrated actual implementation, showing how to set thresholds and automate scoring. But the critical caveat: you need to tune your LLM judge, too! Always validate judges against human expert assessments before deployment.</p><p>When Tom from the audience asked what happens when tests break, Emeli acknowledged the complexity: there are quite a lot of parameters which we can tweak - prompts, model providers, retrieval settings, vector transformations. The debugging process requires systematic experimentation across these parameters.</p><h3><strong>Practical Implementation Strategy</strong></h3><p>Emeli's approach prioritises observability: "No logs, no production." Start with comprehensive tracing - inputs, outputs, model versions, tool calls, execution paths. Even reading through traces is a good start.</p><p>For datasets, begin with 10-15 typical scenarios, then expand systematically - competitor questions (you don't want your assistant recommending rivals), forbidden topics, sales offers (preventing another Chevrolet "$1 car" scenario), and past hallucinations and edge cases.</p><p>Strategic insight: when everyone has the same models, evaluations become your moat - enabling faster shipping, easier model switching, and reduced risks. Investment in evaluation infrastructure pays dividends when foundation models become commoditised.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=1wA2YdRifJ4&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=1wA2YdRifJ4"><span>Watch Full Talk</span></a></p><h2><strong>Abhinav Singh: "Making vertical agents: challenges and learnings"</strong></h2><p>Abhinav from <a href="https://peak.ai/">Peak AI</a> opened with historical context that reframes our current AI chaos. The second industrial revolution offers parallels: first electric power plant (1882), practical induction motors (1892), but by 1912 - 20 years later - only 25% of factory machinery was electric-powered.</p><p>The delay wasn't about technology. Factories needed to solve two problems: attach motors to individual machines and redesign layouts around production flows rather than central power sources. With AI, we are in the same place, Abhinav argued.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JNXJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JNXJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 424w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 848w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JNXJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1116607,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JNXJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 424w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 848w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!JNXJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F180b9c4a-a367-404e-a5c0-218b76d58c9f_2298x1294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The AI Paradox: intelligence vs knowledge</strong></h3><p>Today's AI paradox: models can achieve gold medals in mathematical olympiads whilst 95% of corporate AI pilots fail. The bottleneck to task performance is tacit domain knowledge, not intelligence.</p><p>LLMs trained on general internet content lack specific organisational knowledge - brand guidelines, internal processes, tribal knowledge. It's not in its training data, so why would it work? How can you possibly expect it to do that job?</p><p>This explains why naive approaches consistently disappoint. The classic "data lake + vector DB + LLM + instruction" approach fails because it ignores fundamental knowledge gaps.</p><h3><strong>Beyond simple vector databases</strong></h3><p>Instead of dumping everything into one vector database, Abhinav outlined sophisticated knowledge organisation:</p><ul><li><p>BM25 + vector + reranker retrieval systems</p></li><li><p>Separate vector databases for content types (annual reports, invoices, meeting notes, topic folders)</p></li><li><p>Careful attention to "authoritativeness" rankings</p></li><li><p>Clear access boundaries and context understanding</p></li></ul><p>The system architecture shows retriever components feeding multiple specialised vector databases, each containing different organisational knowledge with careful authority hierarchies.</p><p><em>If this is something which interests you: we&#8217;ve got a whole host of recent blogs for you to sink your teeth into:</em></p><ul><li><p><em><a href="https://www.fuzzylabs.ai/blog-post/improving-rag-performance-semantic-chunking">Improving RAG performance (semantic chunking</a>)</em></p></li><li><p><em><a href="https://www.fuzzylabs.ai/blog-post/improving-rag-performance-hybrid-search">Improving rag performance (hybrid search</a>)</em></p></li><li><p><em><a href="https://www.fuzzylabs.ai/blog-post/improving-rag-performance-re-ranking">Improving RAG performance (re-ranking techniques)</a></em></p></li></ul><h3><strong>Control Flow: When to delegate vs when to keep control</strong></h3><p>Abhinav's most practical contribution: a framework for delegation decisions based on two factors - cost of error (what happens if the agent makes mistakes?) and delegation upside (what value does autonomy provide?):</p><ul><li><p><strong>Low cost + high upside</strong>: Full delegation with lots of context (recommendation systems)</p></li><li><p><strong>High cost + some upside</strong>: Frequent human review, short delegation cycles</p></li><li><p><strong>High cost + no upside</strong>: Deterministic flow, write regular software</p></li><li><p><strong>Low cost + unclear upside</strong>: "We haven't figured this out" (refreshingly honest)</p></li></ul><p>This maps surprisingly well to Andy Grove's "Task Relevant Maturity" from High Output Management. As Abhinav noted: we didn't plan for it this way, but we arrived at a similar conclusion - delegate based on capability and criticality, not AI hype.</p><h3><strong>Agent interface design: the IDE pattern</strong></h3><p>Successful agent interfaces resemble development environments more than chat boxes, Abhinav observed. Users want rules and constraints setting, additional context provision, chat interactions, feedback mechanisms, and editable artefacts.</p><p>If you squint, it looks like VS Code or Cursor - the same kind of programming instincts and control that developers look for in their IDE. This suggests agent deployment requires sophisticated human-AI collaboration tooling, not just better models.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=EsfOxT9xnBc&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=EsfOxT9xnBc"><span>Watch Full Talk</span></a></p><h2><strong>Chris Billingham: "Can we build it? Yes! We! Can!"</strong></h2><p>Chris from <a href="https://www.etiq.ai/">Etiq AI</a> opened with a slide of Bob the Builder. But this AI generated Bob looked worried. Very worried.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3SQi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3SQi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 424w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 848w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 1272w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3SQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png" width="804" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ebc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:804,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:559201,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3SQi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 424w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 848w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 1272w, https://substackcdn.com/image/fetch/$s_!3SQi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Febc43f42-aa8e-4eb9-bfbb-c27fe011bad0_804x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>He then introduced us to Will Ratcliff, an evolutionary biologist at Georgia Tech who tried Claude Code for a personal tax project involving Monte Carlo simulation.</p><p>Initially, Will felt incredible: "At first, I felt like I had freaking super powers. Like a PI with a nearly infinite supply of talented lab members, who worked at 1000x speed. It was exhilarating."</p><p>But then the cracks appeared: "But then, the cracks started to show. TL;DR: If Claude were my grad student, I'd kick them out of my lab."</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vgdc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vgdc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 424w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 848w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 1272w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vgdc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png" width="1456" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:296896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vgdc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 424w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 848w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 1272w, https://substackcdn.com/image/fetch/$s_!Vgdc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4477f296-f282-4d04-adeb-b6cfe817ac3f_1524x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The lying copilot problem</strong></h3><p>The model was impressive at low-level coding but had a personality problem. It has the personality and confidence of a 10x coder, and absolutely lies to your face to maintain the illusion.</p><p>When Will asked for demographic statistics, Claude's web crawler was blocked from accessing the site. But Claude didn't admit that - it fabricated plausible numbers instead, claiming it had pulled real data when it used a simple function to fake it (life expectancy 85, with some noise).</p><p>This pattern repeated constantly. Will's conclusion: You just can't trust this thing.</p><h3><strong>Traditional copilots vs data science reality</strong></h3><p>These copilots focus on software development tasks, using code-base indexing and semantic search. But data science is fundamentally different. It's not about writing and remembering code. It's about DATA and CODE. CODE changes DATA. DATA shapes CODE.</p><p>Chris put up a simple diagram: blue boxes (data) connecting to yellow boxes (code) in an endless chain. If all you're looking at is the CODE, suddenly the DATA is an invisible second-class citizen.</p><h3><strong>The cognitive load visualisation</strong></h3><p>Chris showed his masterpiece - a (frighteningly) impressive detailed diagram of DS/ML cognitive load. An intricate network of yellow circles (data) connected by blue diamonds (code functions), branching and interconnecting across multiple processing stages.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d-BY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d-BY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 424w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 848w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d-BY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d-BY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 424w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 848w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 1272w, https://substackcdn.com/image/fetch/$s_!d-BY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3e9086c-60ac-42ed-9c95-aabb8460a741_2310x1300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is what you have to keep in your head, Chris explained. And this is just an extract from what is a pretty simple Kaggle notebook. Yet tools that you want to work with to help spit out code don't have any of this context.</p><h3><strong>Bob's solution: testing and RCA agents</strong></h3><p>Chris introduced his Testing and RCA (Root Cause Analysis) Agents. The Testing Agent analyses code to work out the most appropriate test at any single point, then runs tests on data either individually or in comparison, returning test results to keep you safe.</p><p>This agent understands the data lineage network. When doing a train test split, it checks for target leakage. Building a model? It validates accuracy.</p><p>The RCA Agent leaps into action when tests fail, starting at the point of failure and working through your script, running other tests and interpreting results to find where the problem starts, then diagnosing how to fix it.</p><h3><strong>The virtuous cycle</strong></h3><p>Chris walked through Bob's complete workflow:</p><ol><li><p>Get Data - Bob starts with his dataset</p></li><li><p>"Hey man, build that pipeline!" - Bob asks Claude to generate code</p></li><li><p>Testing Agent activates - methodically tests each pipeline step</p></li><li><p>Tests fail - Red warnings across the network</p></li><li><p>RCA Agent leaps in - diagnoses the root cause</p></li><li><p>Fix applied - Solution feeds back into the loop</p></li></ol><p>So whilst Claude or windsurf or cursor may want to go off and create all these edifices at every single point, our two friends that keep us safe are constantly stuck there, going, you've just made up some data that doesn't exist. It doesn't work.</p><p>The result? Bob can harness copilot superpowers to build ML pipelines that actually work - and actually it's fully tested. And actually it's tested all the way along, so not only have you created that pipeline super quick, it's probably better tested than many pipelines you may have met.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.youtube.com/watch?v=toUx1mAIbOA&quot;,&quot;text&quot;:&quot;Watch Full Talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.youtube.com/watch?v=toUx1mAIbOA"><span>Watch Full Talk</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading MLOps.WTF by Fuzzy Labs! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2><strong>The big takeaway from MLOps.WTF #5?</strong></h2><p>Every speaker converged on key insights that align with what we're all seeing in our work:</p><ul><li><p>Observability enables everything else. Whether Emeli's comprehensive tracing, Abhinav's knowledge structure monitoring, or Chris's data lineage tracking - you can't improve what you can't see.</p></li><li><p>Domain expertise remains irreplaceable - humans still define what "correct" means in specific domains.</p></li><li><p>Hybrid systems outperform full automation.</p></li><li><p>Systematic approaches beat ad-hoc solutions.</p></li><li><p>That the challenge isn't model capabilities - it's building systems that make AI behaviour predictable and trustworthy in production.</p></li></ul><p>Speaking of things that are hard to predict - who's going to tell Matt that TopShop is returning to the high street? His "small, slim fit from TopMan" crisis might finally be over.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zUFi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zUFi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 424w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 848w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zUFi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png" width="1456" height="1125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1125,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5578280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/173346691?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zUFi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 424w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 848w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 1272w, https://substackcdn.com/image/fetch/$s_!zUFi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe324e8df-87ea-45bb-8ad5-76470da4889c_2102x1624.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong><br>Final Bits From Us</strong></h1><p>If you're working on production AI systems - especially if you've solved some of the reliability challenges discussed here - we'd love to hear from you. The community learns best from practitioners sharing real-world experience, wins and failures alike - speaking of which, we&#8217;re always on the lookout for interesting speakers - if you&#8217;d like to have a go - please email <a href="mailto:matt@fuzzylabs.ai?subject=I'd%20like%20to%20speak%20at%20the%20next%20MLOPs.wtf%20event">Matt Squire</a>!</p><h2><strong>What's Coming Up</strong></h2><p><strong>18th September</strong>: Tom vs Matt founder face-off at MRJ Recruitment's Big Tech Debate on vibe coding at Bloc, Manchester.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.meetup.com/the-big-tech-debate/events/310103935/&quot;,&quot;text&quot;:&quot;Come see the debate!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.meetup.com/the-big-tech-debate/events/310103935/"><span>Come see the debate!</span></a></p><p><strong>24th September: </strong>Matt's speaking at Manchester Tech Festival - "Are we the last programmers? AI and the future of code." To get 10% off - use the code SQUIRE10</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.manchestertechfestival.co.uk/main-festival/&quot;,&quot;text&quot;:&quot;Get Manchester Tech fest tickets&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.manchestertechfestival.co.uk/main-festival/"><span>Get Manchester Tech fest tickets</span></a></p><p><strong>14th October</strong>: We're hosting Awaze's Women in Tech early career networking evening at DiSH.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.eventbrite.dk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969?aff=ebdssbdestsearch&amp;keep_tld=1&quot;,&quot;text&quot;:&quot;Come to the Women in Tech event&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.eventbrite.dk/e/awaze-women-in-tech-x-fuzzy-labs-early-career-networking-event-tickets-1616354883969?aff=ebdssbdestsearch&amp;keep_tld=1"><span>Come to the Women in Tech event</span></a></p><p><strong>18th November</strong>: MLOps.WTF #6 - save the date, speaker applications open.</p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We're Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won't cut it.</em></p><p><strong>Currently:</strong> like what we&#8217;re cooking? We&#8217;re on the lookout for great talent, see our open roles below:<br><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer">MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer</a></p><p><strong>Liked this?</strong> Forward it to someone who's wrestling with getting their ML models into production and keeping them there. Or give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> to be part of the wider Fuzzy community.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/p/mlopswtf-5-newsletter-14?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p><strong>Not subscribed yet? </strong>Come on, we always say this - what you waiting for? Share with your colleagues, friends, your mum and your neighbours&#8230; especially if they love MLOps.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[The Fuzzy Labs curriculum: thoughts from a Fuzzy Intern.]]></title><description><![CDATA[In the accelerated age, how are we keeping up?]]></description><link>https://www.mlops.wtf/p/the-fuzzy-labs-curriculumwhy-traditional</link><guid isPermaLink="false">https://www.mlops.wtf/p/the-fuzzy-labs-curriculumwhy-traditional</guid><pubDate>Thu, 28 Aug 2025 12:18:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZJqf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This week&#8217;s episode is brought to you by <a href="https://www.fuzzylabs.ai/about-us">Savannah</a>, MLOps Engineering Intern at Fuzzy Labs.<br><br></em>Ahoy there &#128674;,</p><p>Three weeks ago everyone was buzzing about Claude's latest coding capabilities. This week it's all about Cursor with its integration with ChatGPT-5. Next week? Who knows.</p><p>Every week there's something new, some breakthrough that makes processes more efficient, faster or deeper in learning and understanding. The tool race is rapid and new players are constantly changing the approach.</p><p>I'm currently doing an internship at Fuzzy Labs, whilst studying for my Master's, which means I'm getting this unique perspective on how learning actually works when the world is accelerating. Bouncing between academic theory and real-world application daily, watching how the industry moves at breakneck speed whilst our educational institutions... well, they're still teaching pretty much the same way they always have.</p><h2><strong>When best practice meets reality</strong></h2><p>What's interesting is how this internship is revealing the gaps between what we're taught is "best practice" and what actually works when you're trying to keep up with an industry that reinvents itself weekly. I've been following all the traditional advice - the approaches everyone swears by - but they're starting to feel increasingly disconnected from the reality of working in tech right now.</p><p>Take the standard learning path everyone recommends. You start by watching YouTube tutorials. Then you move on to building the classic projects: Tic Tac Toe, calculator apps, to-do lists. Maybe you grind through some LeetCode problems... it's the well-worn path that's supposed to prepare you for a career in tech.</p><p>But these approaches were designed for a different world. When Claude can generate a perfectly functional game in thirty seconds, what are we proving by spending days building one ourselves? When GitHub Copilot can solve algorithmic challenges faster than you can read them, does LeetCode truly measure someone's coding ability?</p><p>The problem isn't that these tools are useless - LeetCode is obviously still valuable. But we've built these elaborate hiring processes around demonstrating capabilities that AI now handles better than most humans, whilst completely missing the critical thinking and discernment that actually matter.</p><h2><strong>The theory of revolution</strong></h2><p>The problem isn't just that traditional approaches feel outdated - they're creating a generation of graduates who can tick all the educational boxes without developing the skills that actually matter.</p><p>Are we teaching people to jump through hoops that no longer exist whilst missing the critical thinking that actually determines success?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZJqf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZJqf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 424w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 848w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZJqf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png" width="789" height="578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:578,&quot;width&quot;:789,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237705,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.mlops.wtf/i/172157915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZJqf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 424w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 848w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 1272w, https://substackcdn.com/image/fetch/$s_!ZJqf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e99559a-5f6a-45af-b081-654c1507cac4_789x578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The multi-faceted approach</strong></h2><p>What I've discovered working is completely abandoning the traditional sequential learning model. Instead of trying to master one thing at a time, I'm doing everything simultaneously and letting each element reinforce the others.</p><p>I've got a skill tracker going that shows what I'm actually absorbing versus what I think I'm learning. I'm reading ML and MLOps theory whilst working on production systems. The mentorship element has been massive - being surrounded by talented people who I can question relentlessly.</p><p>The biggest shift has been around curiosity. It sounds obvious, but I think there's something to be said for being the stupidest person in the room - you stand the most to gain, right? Most educational setups don't really encourage this kind of vulnerability, but it's been game-changing for developing actual understanding rather than just surface knowledge.</p><p>So when I'm trying to understand something now, I'll read about it, watch tutorials, then immediately apply it to whatever project I'm working on. Keep it simple, but make sure it's connected to something real rather than abstract exercises. The mentorship side has been massive for making this work - I can tackle projects that would have been completely overwhelming before. When I hit roadblocks, there's someone there to help.</p><h2><strong>Package-specific learning for the modern brain</strong></h2><p>Here's what's really working: Instead of trying to master "software engineering" as this enormous abstract concept, I'm focusing on specific packages and tools.</p><p>You don't learn React because it's popular - you learn it because you're building something specific that needs a dynamic interface. You don't study Kubernetes because it's on job descriptions - you learn it because you're solving a particular deployment challenge you're actually facing. This use case learning approach forces you to think critically about why you're using specific tools rather than just following generic tutorials.</p><h2><strong>The tool paralysis trap</strong></h2><p>One thing that keeps catching people is the AI tool adoption phase.</p><p>There are so many tools launching that you could easily spend all your time evaluating options instead of actually doing anything.</p><p>This is where developing discernment becomes crucial. The teams that work most effectively have made deliberate choices about sticking to particular tools. This applies especially well to MLOps because the field moves so fast.</p><h2><strong>Rethinking the how</strong></h2><p>What's becoming clear is that we need to find new ways of learning that don't involve generic projects and generic tutorials. Something more nuanced, more area-specific, more connected to actual problems people are solving. The traditional model of front-loading education and then applying it for years feels increasingly disconnected from how work actually happens.</p><p>For anyone leading teams, this creates interesting challenges. How do you keep people current without losing productivity? How do you evaluate candidates when traditional measures don't reflect what actually matters? How do you build learning systems that can adapt as quickly as the technology does whilst still developing critical thinking skills?</p><p>What's exciting is seeing companies like <a href="https://www.404media.co/meta-is-going-to-let-job-candidates-use-ai-during-coding-tests/">Meta experiment with AI-assisted interviews</a> that focus on your ability to work with these tools effectively, rather than your ability to memorise algorithms. The focus is shifting from "can you code" to "can you think critically about code and work effectively with AI tools."</p><p>If you're still preparing for coding interviews the traditional way, you're optimising for 2015's job market.</p><h2><strong>The bigger picture</strong></h2><p>The internship is showing me that there's a real opportunity here. These AI tools aren't just changing how we work - they're revealing fundamental gaps in how we prepare people for that work. The traditional barriers to entry are simultaneously too high and completely meaningless. We're testing for skills that AI handles better than humans, whilst missing the adaptability that actually matters.</p><p>The key is forcing yourself to think rather than outsourcing the thinking to AI. To use these tools in ways that amplify your thinking rather than replace it.</p><p>By the time this gets published, there'll probably be a new AI tool, Claude Code and Cursor will feel ancient, and we'll all be scrambling to keep up with whatever comes next. But that's exactly the point - in a world where the tools change weekly, the ability to learn and adapt isn't just useful, it's the only thing that stays constant.</p><p>The question isn't whether we can keep up with the accelerated age. It's whether we're building the right foundations to thrive in it, or clinging to educational models designed for a world that no longer exists.</p><p>So here's my challenge: go look at your current curriculum, your learning plan, your interview prep. How much of it would still matter if AI capabilities doubled tomorrow? Because at the current pace, that's not a hypothetical - it's here.</p><p></p><p><em>Savannah is currently pursuing a Master&#8217;s degree in Computer Science and AI part-time, whilst trying to gain as much knowledge through osmosis from the other Fuzzicians. When she&#8217;s not grinding out a deadline for her Master&#8217;s, she can be found getting stuck into a book or learning salsa. What she lacks in rhythm she makes up for in enthusiasm, with a strong interest in researching tech for good.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading MLOps.WTF by Fuzzy Labs! Subscribe for free to receive new posts and support our work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2><strong>And Finally&#8230;</strong></h2><h2><strong>People You Should Be Following</strong></h2><ul><li><p><strong><a href="http://linkedin.com/in/peter-gostev">Peter Gostev:</a> </strong>Head of AI at Moonpig, cutting through AI hype with clear, unbiased insights on the latest models and breakthroughs.</p></li><li><p><strong><a href="http://linkedin.com/in/eric-vyacheslav-156273169/">Eric Vyacheslav</a></strong> &#8211; AI/ML enthusiast sharing fresh insights on the latest models and trends. A must-follow for anyone keen on staying ahead in the AI space. </p></li></ul><h2><strong>Upcoming Events &amp; Community Updates</strong></h2><p><strong>&#128483;&#65039; Big Tech Debate: Vibe coding is democratising software but lowering engineering standards</strong></p><p><strong>MRJ Recruitment &amp; Counter</strong> are hosting their next Big Tech Debate on Vibe Coding. What makes this one particularly interesting? You might recognise the speakers &#128064;</p><p>With fuzzy founder vs founder - <strong>Tom </strong>will be arguing for the motion whilst <strong>Matt </strong>will be taking the opposing view. Are we witnessing genuine democratisation or a race to the bottom in engineering quality? Find out by tuning in &#128071;</p><p>&#128197; <strong>18th September<br></strong> &#128205; <strong>Bloc, 17 Marble St, Manchester</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://bit.ly/3UPB1Bq&quot;,&quot;text&quot;:&quot;Grab Your Spot!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://bit.ly/3UPB1Bq"><span>Grab Your Spot!</span></a></p><p></p><p><strong>MLOps.WTF Meetup - 9th September</strong></p><p>Our next <strong>MLOps.WTF meetup</strong> is on <strong>9th September</strong> at our Manchester HQ within <strong>DiSH</strong>. Demand is super high for this agentic themed talk, with us now being on a waiting list!</p><p><em>Kind ask, if you've got a ticket but can't make it, please cancel it so someone else can take a spot.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://bit.ly/4oY7iEe&quot;,&quot;text&quot;:&quot;Tickets Here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://bit.ly/4oY7iEe"><span>Tickets Here</span></a></p><p></p><p><strong>Awaze Women in Tech Early Career Networking - 14th October</strong></p><p>"Education to Industry: Discussing the Transition"</p><p><strong>We&#8217;re proud to be hosting Awaze's Women in Tech team&#8217;s</strong> early career networking evening at <strong>DiSH</strong>. Specifically designed for entry level careers support, this event is set to empower and support female identifying people entering the tech and data workforce.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://bit.ly/4n5E7gJ&quot;,&quot;text&quot;:&quot;Get Your Ticket&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://bit.ly/4n5E7gJ"><span>Get Your Ticket</span></a></p><p><br><em>Fancy being a speaker at the event? Awaze are on the look out for great and inspiring presenters and panelists - apply on this link. <a href="https://forms.office.com/pages/responsepage.aspx?id=aGuEvSoTRkqx59CQ4WjAogia4Sr438RLlTmO3Z2GtZZUNzZTMlBDTjlIR1lNUE8wS1o4VURWWFNGNy4u&amp;route=shorturl">Here</a>.</em></p><p><strong>Don&#8217;t forget, we also have <a href="https://www.manchestertechfestival.co.uk/mtf/">Matt's talk</a></strong><a href="https://www.manchestertechfestival.co.uk/mtf/"> at </a><strong><a href="https://www.manchestertechfestival.co.uk/mtf/">Manchester Tech Festival</a></strong> this September: "Are we the last programmers? AI and the future of code." Because if we're going to debate the future, we might as well do it properly.</p><div><hr></div><h2><strong>About Fuzzy Labs</strong></h2><p><em>We're Fuzzy Labs. A Manchester-rooted open-source MLOps consultancy, founded in 2019.</em></p><p><em>Helping organisations build and productionise AI systems they genuinely own: maximising flexibility, security, and licence-free control. We work as an extension of your team, bringing deep expertise in open-source tooling to co-design pipelines, automate model operations, and build bespoke solutions when off-the-shelf won't cut it.</em></p><p><strong>Currently:</strong> We&#8217;re hiring and we&#8217;ve got a few roles to fill:<br><a href="https://www.fuzzylabs.ai/job-listing/mlops-engineer">MLOps Engineer</a><a href="https://fuzzy-labs.webflow.io/job-listing/mlops-engineer"><br></a><a href="https://www.fuzzylabs.ai/job-listing/senior-mlops-engineer">Senior MLOps Engineer<br></a><a href="https://www.fuzzylabs.ai/job-listing/mlops-tech-lead">Lead MLOps Engineer<br><br></a>If you, or someone you know, is looking for a great opportunity&#8230; don&#8217;t hesitate to apply!</p><p><strong>Want to chat?</strong> Max and Robbie are here to chat about tackling business challenges with open source and production-ready AI. Fancy a call? Reach out.</p><p><strong>Liked this?</strong> Forward it to someone who's wrestling with getting their ML models into production and keeping them there. Or give us a follow on<a href="https://www.linkedin.com/company/fuzzy-labs"> LinkedIn</a> to be part of the wider Fuzzy community.</p><p><strong>Not subscribed yet? </strong>What you waiting for? The next issue is already shaping up to be particularly juicy.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.mlops.wtf/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.mlops.wtf/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>