The quagmire, the creek, and the wild west: AI agents in finance
MLOps.WTF Edition #28
Our MLOps.WTF meetup #8, where record turnout met record met weather, Schitt’s Creek fans were sated, and BlockRocket took off.
In an unexpected twist for spring, the miserable March evening (complete with sleet) had no impact on the steady MLOps lovers of Manchester. Our 8th MLOps.WTF was absolutely rammed!
The topic was AI agents in finance, more specifically: how do you build agentic systems in production, in a regulated sector, when governance, risk, and compliance are asking entirely reasonable questions but it’s not been done before? “It’s safe to say there’s no easy answers here. There’s no industry standard way to do this right now. We are all learning together as we go along.”
Matt kicked us off, explaining that our CEO Tom has basically automated his entire job this week, and asking: if agents can already raise pull requests and send emails on your behalf, why can’t they also make payments?
Three talks, one great gilet reveal and a particularly strong and ethically ambiguous case for giving your AI agent access to your bank account. 👇





Christopher Brook, Lloyds Banking Group: “Enabling Agentic Operations at Scale”
Christopher is principal engineer for Hive Lab, an internal platform Lloyds are building to run agentic operations across the group.
Building one AI agent is pretty straightforward. The problem is doing it across an organisation of 65,000 people without ending up, as Christopher put it, “in a quagmire of solutions”. Hive Lab’s answer can be distilled into useful five pillars, or a simplified colosseum depending on which way you look at it.
Pillar 1: Registration. Getting agents visible
How does a developer know what agents already exist before building their own? And more importantly here: how does an agent know? Without a shared catalogue, you end up rebuilding things that already exist and agents that can’t find each other. LLoyds Banking Group’s solution: any agent or tool that goes live automatically registers itself in a shared registry queryable by both humans and agents.
Pillar 2: Orchestration. How agents find each other at scale
As the system grows, agents need to know what other agents exist. You can hardwire it (agent A always calls B, C and D) but then every time you add a new agent, you’ll need to update that list… forever. Or you could let agents discover each other and collaborate dynamically at runtime, which can scale but becomes much harder to control. At Lloyds’ size, hardwiring isn’t viable long-term, so dynamic discovery is where they’re headed. “There is no right answer to which pattern is the right thing to follow.” Yet.
Pillar 3: Tooling. Abstract your legacy systems away from your agents
With more than 500 applications, each holding a different slice of the same customer. If you build agents that connect directly to those systems, you’ll be writing maintenance tickets every time an API changes. Their answer: stop thinking about systems entirely. Organise by what the data is, customer information, events, products, and build tools that sit above whichever system holds it underneath. That way the APIs can change but the tool stays the same.
Pillar 4: Memory. Two different problems
The first is knowledge quality. In any organisation this size, internal documentation accumulates contradictions and repetition over time. You can’t just point an agent at raw text and expect sense back. So every document goes through a pipeline first: cleaned, structured into summaries and embeddings the agent can actually use. An agent is only as good as the knowledge you give it. This is where you make sure that knowledge is actually good.
The second: agents should remember past conversations. What it said to a customer last week, what was agreed, what context carries over. Session summaries need to follow the agent into the next call. That one’s less solved, but they’re making progress. Looks like they’ll be by our side on every step of the journey.
Pillar 5: Evals. Same process, new names.
The tools have new names (RAGAS, DeepEval, Pegasus, the usual suspects) and version hashing means you can trace a bad answer back to exactly which release introduced it. But the principle is the same one we all already know; don’t skip your quality checks because the stack looks different.
By going back to the basics of good engineering, you’ll be able to ship agents you can be confident in.
The two things that run through all of it
Security and authentication came up across every pillar. At this scale you can’t just let agents call each other freely — every agent needs to know what it’s permitted to do, and be able to prove it. The other thread is cost. Sixty-five thousand people running agents that all make LLM calls adds up fast, and observability is how you stay on top of it before it becomes a nasty surprise. Design both in from day one.
Dmitry Leyko, thinkmoney: “Agentics of Order”
Dmitry Leyko is Head of AI and ML at thinkmoney, an e-money fintech based in Media City.
As he took to the stage, he paused: “I realised I forgot something. Talking about fintech. I have to have a gilet, right?” Eight meetups in. The speakers know their audience.
thinkmoney are building a Financial Smart Assistant. They have an EMI licence, which means they don’t just discuss your money, they hold it, move it, manage it. That changes what the AI needs to be able to do. It can’t just be broadly helpful. It needs to be accurate, auditable, and trustworthy enough to act on someone’s behalf with their actual money. “That is what makes this a financial agent, not a chatbot.”
“How do we bring absolute trust to the agentic system?”
Well, the real question isn’t how we bring absolute trust, but how you build enough trust (enough for governance, for risk, for compliance) to actually ship.
The answer: get GRC (Governance, Risk, and Compliance) in the room from day one. If you’ve told them you’ve got to “fold in the cheese” they need to not only know what that means but have been with you in the kitchen from the beginning.
DeepEval: find out your agent is lying in continuous integration, not in a customer conversation
Dmitry demoed DeepEval, connected to Llama. The demo caught an agent telling a customer their replacement card would arrive in seven to ten working days when the knowledge base said three to five. That’s what CI is for. Allowing you to flag if your agent is fibbing, before your customer does.
Post-deployment: your observability layer is your evidence for GRC
LangSmith runs live evaluations as customers chat, scoring every turn across quality, accuracy, and a multitude of other metrics.
But more importantly, every message carries state: enabling us to ask was this a vulnerable customer? What was the agent’s decisioning at that point?
This gives you the audit trail. Build it from the start.
The loop still has humans in it
We acknowledge that we need humans in the loop, but also mistakes can happen, it’s why it’s called human error. Someone might delete a node in the eval pipeline. Someone might edit something they shouldn’t. The CI gate catches it before it reaches test. Then you deploy, observe, evaluate, learn, and run the loop again.
But equally, we also have agent error, and someone still needs to look at what the system is doing. We want to be continuously checking what our agent is up to - and for a regulated fintech holding real money, this is vital.
Andy Gray and James Morgan, BlockRocket: “No Human in the Loop”


Andy Gray and James Morgan co-founded BlockRocket and have been building on blockchain since 2017. They knew exactly what they were riding into. After Christopher’s five-pillar framework and Dmitry’s hidden state layer, they saddled up with: “Maybe we’re on the slightly more wild west side of that space versus the traditional banking.”
Their question: if your agent can already do things on your behalf, why can’t it pay for things?
A Twitter bot that struck gold and couldn’t get to the bank
About a year ago, someone gave an AI agent a Twitter account, a crypto wallet, and a starting pot of money. The agent traded, attracted followers, launched a token. Token hit $7 million. Creator tried to withdraw thinking they’d hit to gold mine… but they couldn’t. The agent had no identity, no bank account, no way to prove the money belonged to anyone at all... The human couldn’t touch it. “All of a sudden, this account has got agency - it’s got cash. What else can you do in the world?”
The takeaway, other than a highly amusing anecdote, is that the gap between an agent generating economic value and anyone actually accessing it is real, and it’s structural.
Why existing payment rails don’t work for agents
ACH transfers (secure, electronic bank to bank transfers) can take days. Card fees make micropayments uneconomical. Every API needs a human to sign up first. The infrastructure was built for people, and it stops dead the moment an agent tries to use it.
If your agent needs to pay for data, spin up compute, or receive payment for something it’s done, traditional finance has no clean answer.
x402: the HTTP status code that’s been waiting since 1995
In 1995, the original HTTP spec included status code 402, of “Payment Required”, and (at the time) marked it “reserved for future use.” Thirty years later, Coinbase and Cloudflare have launched x402 to finally stake the claim.
The flow: agent sends a GET request. Server responds 402 with a price and wallet address. Agent retries with payment in the header. Server responds: 200 OK. Data arrives, payment settled on-chain in about 200ms at under a tenth of a cent. No accounts. No chargebacks. At all. Ever.
CoinGecko is gating data through it today. Stripe integrated it in February 2026. Google’s A2A protocol has it built in. There’s no shortage of companies riding the same trail. It’s very much a given that this new payment process will have a big impact within how we think about online payments. If it’s not here already.
People are settling this new frontier, but there’s definitely no sheriff yet. “It’s a very early protocol, but it’s very fun to build on.”
Ride first, sort the fence posts later. It’s how most of the internet got built.
The big takeaway from MLOps.WTF #8?
Get GRC in the room early. You want Governance, Risk, and Compliance on the journey with you. They should be aware of what’s being built, even if they don’t fully understand it. The earlier you can bring them in, the better.
Human QA is still in the loop. Better evals, golden datasets, live observability are genuinely useful, but someone needs to be looking at what the system is doing with a fine tooth comb. We’re not fully confident of full automation, yet.
Aim for enough trust, not absolute trust. It’s more realistic but the level of trust you need is still extremely high.
The agentic payment layer is coming. The compliance questions are very vague, the full scale a bit hazy, but the infrastructure is there and the wheels are in motion.
Thank you to Christopher, Dmitry, Andy, and James. You set a very high bar!
Final bits
Hot off the press: our recipe book, Cooking with MLOps, is out. Tried and tested approaches to building delicious AI systems across a range of real situations.
Download it via the link, or email us and we’ll whip something up.
What’s coming up
Our next event is a panel on agent security — can agents ever be truly safe, secure, and trustworthy?
📅 20th May. Save the date.
If you want to speak about agent security, or know someone who should: get in touch.
About Fuzzy Labs
We’re Fuzzy Labs. Manchester-rooted open-source MLOps consultancy, founded in 2019. We help organisations build and productionise AI systems they genuinely own.
We’re hiring: MLOps Engineer, Senior MLOps Engineer, Lead MLOps Engineer and Public Sector Lead (Secure Government).
Liked this? Send it to someone trying to convince their compliance team that agentic AI is the way to go. Or give us a follow on LinkedIn.
Not subscribed yet? We know where you live. Just kidding…






