Are Machine Learning Libraries Converging?

MLOps.WTF Edition #7

Oct 24, 2024

Matt Squire is away on his honeymoon, so this edition of MLOps.WTF is brought to you by Danny Wood, Senior MLOps Engineer at Fuzzy Labs.

In the late 1800s, there were two competing methods for transferring electricity over power lines: direct current, the low voltage approach championed by Thomas Edison, and the high voltage alternating current advocated for by George Westinghouse. There were many years of fierce competition and bitter dispute between the two, with Edison’s vision of city streets lit by incandescent bulbs running on direct current fighting it out against Westinghouse's plans for alternating current powered arc lamps.

While alternating current eventually won out, arc lamps didn’t, and by the 20th century, streets started to be lit by incandescent bulbs on power grids that were primarily alternating current. Ultimately, the solution took the best parts of both worlds.

An Example of an Arc Lamp

Over the past few years, we have been seeing a similar thing play out with TensorFlow and PyTorch, the machine learning libraries from Google and Meta, respectively. While both frameworks have their own strengths and weaknesses, they have also both borrowed liberally from each other in their quest for market share.

And while we’re far from being able to declare one framework or the other as having won out, they have gone from approximately level pegging in terms of popularity, to PyTorch taking a noticeable lead.

At the time of writing, PyTorch is nearly twice as popular according to Google Trends, and has been downloaded 35 million times from PyPI in the last 30 days, versus TensorFlow’s 19 million. PyTorch has also hugely benefited from being the default backend for Hugging Face’s Transformers library and being a key component in models from new big players like Anthropic, Mistral and Stability AI.

This doesn’t mean that PyTorch is objectively better, or that TensorFlow will fade into irrelevance, but the community feels like it’s picked a side in a way that will be hard for Google’s TensorFlow to reverse. Was this inevitable?

The early days of deep learning

In the early days of deep learning, not only was it not clear what framework would win out, it wasn’t even clear what language the future of machine learning would be written in. While there was nothing as fully-featured as today’s options, there were a plethora of frameworks out there. The original Lua-based version of Torch had been around since 2002; libraries such as Caffe and MXNet offered multiple language bindings; and before the release of TensorFlow, it was Theano flying the flag for Python.

None of these had the usability of modern frameworks, though being written in Python gave Theano an advantage. The language was becoming increasingly popular, and had a reputation for being easy enough for beginners to pick up, but powerful enough to do everything that machine learning researchers needed.

Built on top of Theano, there was also Pylearn2. Theano primarily offered an engine for performing differentiable computation, while Pylearn2 (amongst other tools like Lasagne and Blocks) sat on top of it to provide the functions and modules required to easily build neural networks. Pylearn2 was described by its maintainer Ian Goodfellow as architecturally being the “Habitat 67 of machine learning libraries”, and you didn’t have to stray far from the intended use cases to understand what he was talking about.

Habitat 67 in Montreal: At least its design is very modular!

Google clearly saw promise in Theano; when they came to build their own internal framework, Theano seems to have been a major influence. This framework became the original TensorFlow and was eventually released to the public in 2015. While Theano had primarily been an engine for writing differentiable programmes, TensorFlow quickly integrated a lot of the best features from the likes of Pylearn2 as well, making it easier than ever before to start training deep learning models.

The first version of PyTorch was released nearly a year later, with Facebook moving away from its Lua-based predecessor to Python. The race for market share had begun.

Different paradigms

As might be expected from two frameworks with such disparate origins, the two systems worked very differently.

TensorFlow had a “lazy evaluation” approach that made it very clear to the user that the models they were building did not really live in Python: the user would add nodes to a computational graph that described the shape of their network, then send that graph off to the GPU, along with any requisite data in order get an output.

On the other hand, PyTorch’s “eager evaluation” was much more intuitive to users of MATLAB or NumPy. When you wrote c = a + b, the value of c would be ready for you to print out straight away, you didn’t need to worry about the fact that you were actually manipulating a computational graph.

TensorFlow’s approach had the advantage of more formally defining the computation and also meant that the GPU was never waiting for slow Python code to execute to proceed to the next step. However, it was also less intuitive to pick up and play with. This was reflected in initial perceptions of the use cases of the two tools: TensorFlow was largely seen as more suitable for making robust production-ready code, while PyTorch was there for scrappy fast iteration and experimentation.

Convergence

With TensorFlow 2, Google changed the default behaviour of TensorFlow. It was now eager by default. No more defining computational graphs and politely asking to see variables after execution, you could interact with things live in the same way as you did in PyTorch. On the other hand, Facebook made efforts to introduce features to make PyTorch more production-ready, such as allowing for compilation of PyTorch functions to minimise inference time.

This is speculation on my part, but I think that PyTorch is winning for much the same reason that Python won: it’s easy enough for beginners to pick up whilst being powerful enough to do everything you need. While TensorFlow had the capability to span multiple GPUs and had a dedicated serving library from the start, the vast majority of users will be drawn to the tool with the lowest barrier to entry, and PyTorch still has the reputation for being that tool.

I don’t think that the current state of the ecosystem, with PyTorch dominating, was inevitable. Perhaps if Hugging Face had chosen to build their libraries around TensorFlow, the situation would have been reversed. But if TensorFlow were dominating, it would be through mimicking a lot of the things that people like about PyTorch.

At the same time, PyTorch has only been able to achieve its current position by copying TensorFlow’s tools and features around scaling and deployment.

Much like with the war of the currents in the late 1800s, it’s not a case of one paradigm winning outright, but instead convergence to a solution that combines the best parts of multiple approaches.

And Finally

This year’s Nobel prizes in physics and chemistry have caused a bit of a stir, with the winners being far more strongly associated with machine learning than the fields for which their prizes were awarded. This got us interested in looking into a few more previous winners who are famous for other things:

Luis Alvarez, who won the 1968 physics prize for his work on resonance states went on to work with his son to develop the asteroid impact theory, the current leading theory for the cause of the dinosaur’s extinction
Andre Geim, who shared the 2010 prize for the discovery of Graphene is arguably equally well known for his frog levitation experiments
And of course, Marie Curie is famous as the only person to have won prizes in two different fields, one in physics and one in chemistry

About Matt

Matt Squire is a programmer and tech nerd who likes AI and MLOps. Matt enjoys unusual programming languages, dabbling with hardware, and giving full editorial control of his newsletters to the best MLOps Engineers he can find. He’s the CTO and co-founder of Fuzzy Labs, an MLOps company based in the UK. Fuzzy Labs are currently hiring so if you like what you read here and want to get involved in this kind of thing then checkout the available roles here.

Each edition of the MLOps.WTF newsletter is a deep dive into a certain topic relating to productionising machine learning. If you’d like to suggest a topic, drop us an email!

Thanks for reading MLOps.wtf by Matt Squire! Subscribe for free to receive new posts and support my work.

MLOps.WTF by Fuzzy Labs

Discussion about this post