Secret bases, floating point disasters, and big graphs

MLOps.WTF Edition #11

Apr 02, 2025

Ahoy there 🚢,

Matt Squire here, CTO and co-founder of Fuzzy Labs, and this is the 10th edition of MLOps.WTF, a newsletter where I discuss topics in Machine Learning, AI, and MLOps.

How do we define MLOps — is it just “DevOps for Machine Learning”, or is there more to it?

A couple of weeks ago we hosted another MLOps.WTF meetup here in Manchester, and I opened the event with this question.

“Lifecycle management of machine learning applications” was one suggestion. While that sounds an awful lot like DevOps, in reality training and running ML models comes with challenges that aren’t seen in the DevOps world. Take monitoring, for example: an ‘ordinary’ application, once deployed, stays put until we release a new version. But a model’s correctness can change if the world around it changes.

For my answer, I go to the 2014 paper from Google Research, Machine Learning: The High-Interest Credit Card of Technical Debt. The authors highlight: the tight coupling of data, code, model, and environment; tracking the correctness of models over time; and the complexities involved in testing ML systems. And these surely need special tooling and expertise.

In production, we’re often faced with in-depth engineering challenges that impact scale, security, and safety. At this meetup, we covered three topics: first, Will Faithful talked about how to work with big graph data, then Dr Edoardo Manino covered the frankly terrifying notion of ML models deployed into safety-critical applications, and finally Thom Kirwan-Evans explained how COVID taught him to think in pipelines before models.

You can watch the highlights below, and you can read on for the full details from each talk.

Graph algorithms at scale

A lot of data is graph shaped.

Imagine you’ve just been hired as a data scientist for a bank. Unfortunately, there’s been a security incident, and your task is to understand precisely how many customers, accounts, and devices may be compromised. Customers can have multiple accounts, which they access from a variety of different devices.

If we think about our data as a graph — that is, a set of entities and relationships between them — then we can make some deductions. For instance, if one customer is compromised, then anybody else who accesses an account from the same device is also compromised.

Will Faithful, CEO of ExaDev, takes us through the engineering tradeoffs involved in choosing graphs over relational databases, how to process graphs efficiently, and training ML models from graph features.

You can watch the full video here:

Floating-point neural network safety

Ah, IEEE754, easily among my five favourite technical standards. It’s the specification behind all modern implementations of floating-point numbers, as available from your nearest friendly Python/C/Rust/etc environment. It’s the standard that gives us 0.1 + 0.2 == 0.30000000000000004, and ‘NaN’ (not a number).

Floating-point numbers aren’t bad per say, but they are quite unintuitive, and often misunderstood by the programmers who use them. Sometimes, the results are disastrous, and that includes a missile defence system failing due to a rounding error, and the loss of the Ariane 5 rocket.

Neural networks tend to use floating-point weights, but we usually don’t worry about what that means for stability. But imagine we want to use a neural net in a safety critical application: Edoardo Manino, a researcher from Manchester University, explored how feasible that really is, and the current state-of-the-art in tooling for ML model verification.

Check out the full video here:

You and whose data? Lessons in remote SecDevOps

Remember COVID? Masks, lockdowns, video calls… five years on, I think we can all agree it was a strange time.

While a lot of techies took up home working without much difficulty, our last speaker, Thom, was working at the time on a super secret government project. And the data he needed for model training existed in one single physical location, on an air-gapped system, presumably with armed guards outside.

Going to the office was out of the question, thanks to lockdown. So how can you train models from home when you don’t have the data, and worse, you’re not allowed to have that data?

Thom Kirwan-Evans, co-founder at Origami Labs, realised two things: firstly, it’s possible to go a long way with synthetic data. Secondly, having a trained model isn’t the outcome to focus on. What’s more useful is having a well-structured training pipeline that can take a dataset and produce a model, along with metrics, as output. Most of the valuable work can be done remotely, because the value is in the pipeline and the tooling that supports it.

You can watch the full video here:

See you at the next one!

We’re building a community of like-minded people with a passion for production AI/ML here in Manchester. Our next event is on the 5th of June — you can sign up here.

We’re always looking for more speakers too, so please get in touch if you’ve got a story to tell about MLOps. We’re keen to ensure a diverse set of voices can be heard too; I’d love to hear from more female speakers and members of minority groups at future events.

MLOps.WTF by Matt Squire

Discussion about this post