Let's build a sovereign LLM

Using Open Source. In Manchester. MLOps.WTF Edition #9

Jan 23, 2025

Ahoy there 🚢,

Matt Squire here, CTO and co-founder of Fuzzy Labs, and this is the 9th edition of MLOps.WTF, a newsletter where I discuss topics in Machine Learning, AI, and MLOps.

Can machines think?

That was the question Alan Turing posed in his 1950s paper Computing Machinery and Intelligence. Written while Turing worked at the University of Manchester, not long after the Second World War, this paper introduced the Turing Test, a thought experiment that asks whether machines that act indistinguishably from humans could be attributed the power of thought.

It’s amazing how important Turing’s 70-year-old paper is to modern AI. Come to think of it, a great deal of foundational work in computer science was done in my home city of Manchester around the same time. That includes the first stored-program computer, nicknamed The Baby; if you’re visiting Manchester, there’s a replica on display at the Museum of Science and Industry.

Manchester Baby Replica. The Board of Trustees of the Science Museum

It’s a bit of a surprise that the UK never became a dominant power in the computing industry, despite having the opportunity. But the government’s announcement last week of an AI Opportunities Action Plan signals a drive to take advantage of the new opportunity in front of us.

The Action Plan seeks to position the UK at the forefront of AI innovation. It includes plans for AI growth zones — I think that just means datacentres — and investment into the skills, data, and infrastructure needed so that the UK can seriously compete on the global stage.

It’s a good sentiment, and for me at least the plan sets the right tone of ambition. There are lots of reasons to justify it too, besides economic power. One is national security: given the significant role AI will play around the world, there’s a clear argument for an independent, homegrown AI sector that gives the UK the freedom to utilise and innovate on this technology well into the future.

The phrase sovereign AI comes up quite a bit in the document, but what does it really take to build a sovereign AI capability?

Other nations are ahead of us here, and that can be seen with the home-grown large language models coming from China, the UAE, and France. These models make good case studies because they touch on all of the major themes in the Action Plan: from the data that goes into them, to the infrastructure needed to train them, and perhaps most importantly the skills needed to make these models successful.

Do we need a sovereign model?

The Action Plan didn’t happen in a vacuum. Conversations about sovereign AI models have been going on in political circles for a couple of years.

In 2022, a memo originating from Cambridge University sent to the UK AI Council (now disbanded) advised the government to “take action” to build a sovereign LLM capability, which it claimed would have major consequences for national security. Last February a House of Lords report advocated for home-grown models, and Adrian Joseph, the Chief Data and AI Officer for British Telecom went to Parliament to present his argument for why the UK needs its own LLM. He said that the AI arms race has already begun, and that without government support, the UK risks being left behind, unable to ever catch up with the rest of the world.

Meanwhile, home-grown LLMs have been appearing from all over the world; everywhere but here.

Mistral originated in France in 2023 and is now among the most widely-adopted open source models in industry. In their strategic memo, which won them €105 million in funding, Mistral highlighted the US market dominance (at the time), and expressed their hope to become a “European leader in productivity” and to “guide the new industrial revolution that is coming”.

China’s Qwen model (or 通义千问, which translates to something like “Understanding 1000 questions”), built by Alibaba Cloud, has been attracting a lot of attention recently, particularly due to its impressive reasoning capabilities. Meanwhile, the UAE have Falcon, an open source model built by the Technology Innovation Institute in Abu Dhabi, claimed to outperform Llama 3.

There’s another argument for homegrown LLMs that doesn’t have so much to do with international competition, as it does with openness and trust. Even among models that are advertised as open source, like Mistral or Falcon, the important things, such as training data, code, and intermediate training steps, are proprietary. This not only leads us to doubt the security and safety of those models (because we can’t independently scrutinise them), it also hinders the ability of the broader AI community to study, replicate, and innovate on the most successful models.

How to build an LLM

One project is looking to change all of that. LLM360 is a community-driven effort to build a truly open source LLM. In their paper Towards Fully Transparent Open-Source LLMs, the founders criticise the industry for not being open about training, fine-tuning, and evaluation processes. They advocate for sharing all training code, data, checkpoints, and intermediate results.

They’ve released two 7B parameter models: Amber, trained on English text, plus a coding model called CrystalCoder. These give us enough information to piece together our own LLM from scratch.

Unsurprisingly, data is the starting point. Most of the training data comes from Common Crawl, which is a free, open repository containing 250 billion web pages spanning 17 years (reminding us that what you put on the Internet will be around forever). More specifically, it’s a combination of RefinedWeb, which is a cleaned-up version of Common Crawl developed by Falcon, plus elements of RedPyjama, a 30-trillion token dataset for large language model training. For CrystalCoder, they add the StarCoder data.

They also share their data preparation code. This code downloads all of the original data, tokenises it, and groups tokens into sequences ready for training. Then, all data is grouped into 360 different chunks. Token sequences are distributed evenly across chunks, with the idea being that we can then iteratively train the model, chunk by chunk, and collect a checkpoint for each.

Next, we need a model architecture. Think of this as our blank network which we’re going to train. LLM360 uses the same model architecture as LLaMA 7B, and it’s worth clarifying that this is not the same as using the pre-trained LLaMA weights. LLM360 trains its own weights, but it borrows the underlying network architecture from LLaMA (code here).

Before we can train, we need to configure the network by selecting e.g. the number of hidden layers and attention heads; as well as setting up training parameters like the learning rate. Typically, this is where we would do hyperparameter tuning in order to select ideal parameters, but LLM360 shortcuts that by choosing hyperparameters based on what LLaMA uses (or is presumed to use).

Finally, the models are trained on an in-house cluster of 56 DGX A100 nodes. Each node has 4x A100GPUs with 80GB memory each, so that’s 17920 GB in total.

A number of models are fine-tuned from the base model. For instance, AmberChat, tuned for instruction following, is trained using the WizardLM dataset. This dataset contains 70,000 examples of instructions and expected output for the model to learn from. Fine tuning takes fewer resources than the original training — AmberChat is trained for 3 epochs on ‘just’ 8 A100s.

Sadly I don’t have access to 56 A100s (or even one!), but some day soon this scale of hardware investment might not even be necessary. At the end of 2024 Chinese AI startup, DeepSeek announced an 11X reduction in GPU resources to train a model that’s comparable to other big players.

Their 671B parameter Mixture-of-Experts model was trained with 2.8 million GPU hours. To put that into perspective, Meta spent about 30 million GPU hours to train LLaMa 3, and that has fewer parameters too (405B).

They still needed a lot of hardware to do it: 2,048 Nvidia H800s specifically. It’s hard to compare directly with LLM360, partly because the hardware choices are different, but also because Deepseek’s model is 2 orders of magnitude bigger than LLM360.

In any case, the state of the art is trending towards cheaper, and this lowers the barrier for entry for any new players in this space.

Back to Manchester

It’s 70 years since Turing asked whether machines can think, and the UK has a renewed opportunity to be a world leader in the very industry which that question spawned. I can’t help feeling that Manchester has a special role to play in shaping that.

It’s great to have ambition. But the government should also be asking how we can differentiate ourselves; what makes us stand out?

First, we should champion true open source AI. This is important for so many reasons: it drives public trust, and encourages collaboration across different communities. It helps to uncover biases and supports better data diversity. Crucially, open source fosters learning and innovation.

Secondly, all these AI growth zones are going to need power, and lots of it. The Action Plan does acknowledge that growth zones need to be sustainable, but it doesn’t say how we do that. Energy-efficient AI is something we’ve been looking into closely at Fuzzy Labs² (read more here), because we see it as a crucial problem to solve.

One final thought is that we need to think bigger about the future. LLMs are already established technology, and as we’ve seen with projects like DeepSeek’s, the barrier to entry for building them is getting smaller; sooner or later everybody and their dog will have their own LLM.

In a few years, there’ll be something completely new in AI. I don’t know what it will be, but it won’t be an LLM. If we’re successful at building a sovereign AI capability, maybe it will even be invented here. In any case, we need to be ready to move and adapt quickly.

And finally

We saw in how to build an LLM that LLM360 uses the LLaMa architecture as its ‘blank network’, which gets ‘filled in’ during training.

I can’t help thinking of a definitely fake story about neural network pioneer Marvin Minsky. Back when the Internet was new, there was a webpage dedicated to “hacker folklore”. Written like a Zen koan, the story goes like this…

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened.

(Referring to Gerald Sussman, and Marvin Minsky, of course).

Thanks for reading!

Matt

About Matt

Matt Squire is a human being, programmer, and tech nerd who likes AI, MLOps and salt & vinegar crisps that taste of salt & vinegar. He’s the CTO and co-founder of Fuzzy Labs, an MLOps company based in the UK. He wants to use AI for positive impact and is currently immersed in how to make it more energy efficient.

Each edition of the MLOps.WTF newsletter is a deep dive into a certain topic relating to productionising machine learning. If you’d like to suggest a topic, drop us an email!

MLOps.WTF by Fuzzy Labs

Discussion about this post