Ahoy there 🚢,
Matt Squire here, CTO and co-founder of Fuzzy Labs, and this is the 10th edition of MLOps.WTF, a newsletter where I discuss topics in Machine Learning, AI, and MLOps.
January was a dramatic month as we witnessed the biggest single-day loss in US stock market history, all because of one Chinese AI startup — DeepSeek.
Why the big reaction? Perhaps in part it was a fear of pending Chinese market dominance. But mainly, it was because Deepseek had a cheaper way to train LLMs, which means less demand for AI hardware, and less profit for companies like NVIDIA, which lost $600bn of value almost overnight.
Still, as financial meltdowns go it could have been worse. I remember when Lehman Brothers declared bankruptcy, ending 158 years of operation. That was the peak of the 2008 financial crisis, which saw global recession, stock markets crash, and long-established institutions disappear overnight.
Warren Buffet famously told investors to be greedy when others are fearful. And one engineering student at Zhejiang University named Liang Wenfeng (梁文锋) saw opportunity amidst crisis, as he began to think about using machine learning to trade stocks.
It takes a pretty exceptional student to form a team and launch a hedge fund as a side project. Yet, what started as a group of friends following their curiosity, a decade later became Highflyer, a 10 billion yuan hedge fund by 2019 (about £1bn / $1.2bn).
The company’s website describes them as “Empowered, Inspired and Enabled by AI”. It includes a demo of their in-house Python-based machine learning platform which is designed for running distributed workloads across thousands of GPUs.
This is where DeepSeek was born. Initially a research lab inside Highflyer, it was spun out in 2023 as DeepSeek, bringing together Highflyer’s expertise and capital and applying it to building world-leading AI.
A brief history of DeepSeek
The DeepSeek team have been prolific in their output and relentless in their innovation.
It started in November 2023, when they released their first model, an ‘ordinary’ LLM with 67 billion parameters. This came with two fine-tuned variants: one for conversations, and another for coding. It wasn’t too remarkable, but it was a proof-of-capability for the newly-formed startup.
Then, in January 2024, the team dipped their toes into the Mixture of Experts (MoE) architectures with a 16 billion parameter model. MoE had originated from Mistral just one month before, in December 2023. It’s an LLM architecture that allows models to be pre-trained and served with a lot less compute than was previously possible (I explore how it works in detail further down).
DeepSeek-V2 soon followed, in June 2024. This was their biggest so far, with 236 billion parameters. The V2 paper describes the key innovations: they combine MoE with another technique called Multi-head Latent Attention, which is a way to get efficient inference by compressing the Key-Value cache — a crucial component of transformer models that tends to be a bottleneck during inference.
We can see from these releases how DeepSeek have been making innovative steps towards cost-effective training for some time. In the V2 paper, they claim to have saved 42.5% on training costs, and reduced their KV cache memory load by 93.3%, which makes inference cheaper. So why is it that the January release of DeepSeek-V3 was the trigger for global market panic — that is to say, why now, and not before?
Looking at V2 and V3 side-by-side, they have a lot in common. Just like its predecessor, V3 is a Mixture-of-Experts model and it uses multi-head latent attention too. They use essentially the same architecture, but V3 is a lot bigger, weighing in at 671B parameters, and it’s been trained on nearly twice as much data, at 14.8 trillion tokens vs V2’s 8.1 trillion.
DeepSeek-V3 outperforms a lot of other models, including Qwen and Llama, and it does this while remaining inexpensive to train. According to the authors, they spent 2.788 million GPU hours using Nvidia H800 GPUs, which they estimated would cost $5.576M (assuming GPUs rented at $2 / hour).
By contrast, OpenAI CEO Sam Altman has said that GPT-4o cost $100 million to train. And that was on more capable GPUs too; the H800 was manufactured to get around US export restrictions preventing the more powerful H100 from being sold to China.
However, it wasn’t V3 alone that propelled DeepSeek to fame. At around the same time, the team released another model trained for reasoning tasks, like logical deduction and mathematics. DeepSeek-R1 is fine-tuned from V3 and performs comparably to the big players, even matching GPT-4o. This model is the basis for DeepSeek’s chatbot app, which last month surpassed ChatGPT as the most downloaded app in the iOS App Store (in the USA).
Which brings us up to the present. Now, let’s take a look at some of the key innovations powering V3 and R1.
DeepSeek-V3’s Mixture of Experts
The use of Mixture of Experts as an LLM architecture originated from the Mistral team in December 2023 as a way of reducing the computation required to train very large models. (MoE itself pre-dates transformer models).
When we train a massive neural network in the traditional (dense) way, each layer gets evaluated in the forward pass, and each layer needs to be updated when we backpropagate. Pretty much all of the training cost goes into computing and updating weights in this way
The trick behind MoE is to have a collection of small sub-networks (experts) at each layer, and to only update a few of them at a time. Each network learns a subset of our total input space: for example, if we have 8 experts in a layer, each training input might be sent to 2 out of 8, and this way we significantly reduce the amount of work that needs to be done per training step.
For this to work, we also need a mechanism for choosing which experts get which inputs. This is called the router or gate network, and there are a bunch of different techniques out there which we don’t need to get into here. As a bonus, the same trick helps us during inference, because again we only need to run a subset of the full network for any given input.
The DeepSeek team have put a lot of work into perfecting MoE. They even implemented their own architecture. It’s a heavy paper, but they highlight some key improvements over other implementations. One is better expert specialisation, i.e. making sure each expert acquires non-overlapping and focused knowledge. They also assign a set of experts as shared ones, possessing common knowledge that can be made available during inference.
Attention with many heads
Another significant optimisation is Multi-Head Latent Attention (MLA). In addition to being a mouthful, it’s a technique that makes inference faster and reduces the memory footprint needed to serve a model.
To understand it we first need to understand a little about the attention mechanism that powers transformer models. This mechanism is how a model learns which tokens to pay attention to in any given context. Under the hood, for each input token, the model looks up or queries the other tokens in order to figure out which ones it should be paying attention to (this is an oversimplification, but for purposes it’ll do!).
Multi-head attention is an advancement that allows the model to attend to multiple different parts of the input sequence in parallel, and crucially to learn about different aspects of it, so one head may focus on grammatical structure, another on word meanings, and a third on how high-level concepts in the text are connected.
Calculating these queries is very intensive. However,the same queries will show up multiple times while generating a sequence of tokens, so instead of recomputing attention for each previously generated token, we can cache keys and values from past tokens, and this gives us the KV Cache.
So far, none of these ideas are new. What is novel in DeepSeek is the latent attention idea. Essentially this gives us a compressed representation of keys and values that are shared across attention heads. And that means we can deploy models in smaller memory footprints.
The R1 secret sauce
The pièce de résistance is DeepSeek’s reasoning model, R1. Using V3 as a base, R1 is fine-tuned to perform reasoning tasks like solving mathematics and coding problems.
This kind of fine-tuning usually relies on having large labelled training sets, i.e. examples of problems alongside good solutions. But DeepSeek accomplished respectable results using reinforcement learning alone, without any labelled data.
The idea is to first let the model solve a problem, and then reward it based on accuracy. Of course, they still need a way to calculate accuracy — for instance, in the case of LeetCode problems they compile the code and run tests in order to evaluate task performance. Ultimately, they found that pure reinforcement learning wasn’t enough to match OpenAI-o1, but adding a small cold-start dataset beforehand allowed them to bridge the gap.
A second insight is what they call the “aha moment”. During training, they encourage the model to allocate more “thinking” time when it needs to do so. The end result is that the model will say something like “Wait, wait. Wait. That’s an aha moment I can flag here.” — which will be familiar to those who have played with DeepSeek chat.
This technique is known as test-time scaling. It turns out that when we train a model to use more time during inference it actually produces better results. At first, that seems surprising, but when we think about it some more*, it makes a lot of sense. After all pausing to think about something tends to work for humans.
* irony intended
Can we reproduce it?
There’s one more topic to wrap up this deep dive: reproducibility. Because as well as being inexpensive to train, the media has made a big deal of DeepSeek’s models being open source. So, just how open source is it — could we in theory reproduce it?
Disappointingly, DeepSeek haven’t actually open sourced everything. The model weights are open and available from HuggingFace. Even better, the checkpoints are published too, so we can see intermediate stages of training. But neither the datasets nor the training code are open source. Contrast this with LLM360 discussed in the previous newsletter, which while nowhere close to DeepSeek-R1 in capability, does fully commit to open source.
A recent project launched on the HuggingFace blog called Open-R1 is leading an initiative to build these missing pieces as a community, creating a state of the art frontier model that’s truly open source.
They’ve only just started, and they’re looking for contributors too!
Thanks to Danny Wood from Fuzzy Labs for helping fill in the finer details of transformer internals for this newsletter.
And finally
Ever wondered what the Earth’s orbit sounds like?
Each time you move octaves on a piano (for example), the pitch of any given note will be doubled. As Dom White shared on BlueSky, if you move just 32 octaves from the middle-C#, you get a note with a frequency of one year, and this means that the Earth orbits the sun at a pitch of C#!
Of course, this doesn’t take into account leap years.
About Matt
Matt Squire is a human being, programmer, and tech nerd who likes AI, MLOps and linguistics. He started teaching himself Mandarin around 6 years ago which has been particularly useful in researching this newsletter. He’s the CTO and co-founder of Fuzzy Labs, an MLOps company based in the UK. He wants to use AI for positive impact and is currently immersed in how to make it more energy efficient.
Each edition of the MLOps.WTF newsletter is a deep dive into a certain topic relating to productionising machine learning. If you’d like to suggest a topic, drop us an email!