The divide over open-source AI, explained

A debate over the availability of the code and data behind AI could shape the future of the technology.

June 18, 2024

• 10 min read

The rift that has emerged in the tech world over the concept of open-source in AI can seem anything but straightforward. The leading player on the “closed” side of this split is called OpenAI. Companies that do claim to be open-source are often accused of “openwashing.” And Mark Zuckerberg is cool now?

If you’re confused by this whole situation, you’re in good company. While the question of how public the code and data behind the latest wave of AI should be has big implications for the future of the tech, the industry doesn’t even seem to quite agree on what open-source AI strictly is, for one.

The Open Source Initiative (OSI), the organization widely seen as responsible for setting that standard, recently embarked on a road trip of sorts to ask around about its proposed definitions. Executive Director Stefano Maffulli told us the complexities of the tech and the pace of its development pose unprecedented challenges.

“This is absolutely never seen before. Not only because of the new artifacts like model weights [and] parameters,” Maffulli told Tech Brew. “Also, the speed at which it came out of the labs. Like, all of a sudden, ‘Look, this thing thinks! Oh, let me build an app!’ I’ve never seen anything like this.”

The lack of a commonly agreed-upon definition hasn’t stopped companies and developers from taking sides. One faction, which includes Meta and IBM, has framed itself as a champion of open-source collaboration and innovation, though some critics have taken issue with Meta’s use of the term.

Meanwhile, OpenAI, Anthropic, and others have kept their systems more proprietary. Proponents of this approach claim secrecy is necessary because open-source AI can be dangerous in the wrong hands.

Then there’s Google and Microsoft, who have mostly taken the closed route, but have also made slight overtures to the open-source movement, perhaps hedging their bets.

Before we get into all that, though, let’s back up.

What is open-source software?

The underpinning philosophy around free and collaborative software has existed almost as long as software itself, according to the OSI. But the coining of the term “open-source” attempted to make the idea more mainstream upon Netscape’s release of its flagship browser’s source code in 1998, a move that gave birth to what was then called the Mozilla Organization.

The idea behind open-source software licenses is to give developers access to study, modify, and build on the code underpinning a given piece of software.

How does that apply to AI?

But that’s a bit trickier when it comes to complex LLMs, which require huge troves of training data, an architecture of layered neural networks, and billions of parameters determined through training to replicate. In the AI world, the question at the heart of the discussion around the definition of “open-source” is this: How many of these ingredients need to be made available, and what restrictions are allowable?

One of the key points of debate is dataset availability. Maffulli said the OSI tasked groups of developers all over the world to help determine what components they need to access in order to modify AI models. The groups reported that they didn’t necessarily need access to all of the training data. Instead, they could often make do with detailed information about the contents of a given dataset, a finding reflected in the org’s current draft definition, according to Maffulli.

“They were ranking a lot higher [in importance] the information, instructions, and tools used to build the datasets,” Maffulli said. “They said it’s much more valuable for me to see…where they got the data, what scope, what they were thinking, because that helps me build a similar dataset to improve on top of that one, without necessarily having access to the original dataset.”

Where is the history of open-source in AI?

Many of the advancements in AI technology have emerged with support from various open-source systems. Pytorch, Google’s TensorFlow, and Keras open-source frameworks have been essential to much of AI and machine learning development.

OpenAI was founded as a nonprofit lab with an open ethos in 2015. The company has been releasing successively less information about each of its flagship models since it announced a gradual rollout of GPT-2 in 2019 out of fear that it was too powerful to fall into the wrong hands. The company switched to a hybrid for-profit structure and attracted a $1 billion investment from Microsoft in the months after that announcement.

Early LLMs like Google’s BERT (2018), the Allen Institute’s ELMo (2018), and Baidu’s ERNIE (2019) were all open-sourced for research purposes.

But as these types of systems have become larger and more commercialized in the last few years, many tech giants have shifted to an API approach, where users only have access to an interface that provides outputs of the system and none of the underlying code that governs how it operates.

Why does it matter?

Even at its most open—when the massive training datasets, the architecture code behind the sets of neural networks, and the resultant parameters are made available—AI is somewhat unknowable by its very nature.

Neural networks are essentially mind-bogglingly huge equations full of nodes that transform an input into an output based on various probabilities, or weights, determined through training. LLMs consist of layers of neural networks and up to 1 trillion parameters that affect the output. So it’s more or less impossible to look at a prompt and see exactly why the model spit out the string of words it did—it’s often referred to as a black-box system.

That’s why many open-source proponents say that having access to the training and operation components of these systems is the only way of beginning to understand why they work the way they do.

Ayah Bdeir, a Mozilla senior advisor who leads the Mozilla Foundation’s AI strategy and collaborations to drive open and trustworthy AI, said that openness about training and how models operate is a step toward understanding the risks of these types of systems.

“If you have all these black-box systems that consumers or regulators or activists or researchers or academics cannot look under the hood of, how do we know whether they are behaving properly or not?” Bdeir told Tech Brew. “How do we know whether they are biased? How do we know whether they are promoting the kind of proprietary solutions that are meant to make profit for the company to create them?”

Are there risks in open-source AI?

Not everyone agrees with the push for more openness. There’s also a camp that argues that open-source AI will allow bad actors to more easily disable built-in safeguards and use AI for nefarious purposes. In an opinion piece published in January in the IEEE’s Spectrum publication, David Evan Harris, a chancellor’s public scholar at UC Berkeley, argued that open-source AI could be tapped to create bioweapons or bombs unless certain regulations are enacted.

“The threat posed by unsecured AI systems lies in the ease of misuse,” Harris wrote. “They are particularly dangerous in the hands of sophisticated threat actors, who could easily download the original versions of these AI systems and disable their safety features.”

On the other hand, a paper published in February from Stanford University’s Institute for Human-Centered Artificial Intelligence evaluated the benefits and marginal risks of open-foundation models. In many cases analyzed, the paper’s authors found a lack of evidence that open models posed additional risk.

“Overall, we are optimistic that open-foundation models can contribute to a vibrant AI ecosystem, but realizing this vision will require significant action from many stakeholders,” the authors wrote.

Which companies are truly open?

Mark Zuckerberg is one of the leading Big Tech voices pushing for a more open approach, routinely shouting out its innovation benefits on earnings calls. Meta’s Llama family of models makes their weights—the model after the training process—available for free.

Some critics have taken issue with Meta’s own description of this availability as “open-source,” noting that there isn’t an agreed-upon definition of that term as it relates to AI. Llama also comes with license restrictions around its commercial use, which some argue means it doesn’t fit the definition of open-source. Those restrictions include that developers must request a license from Meta if Llama is used for a product or service with more than 700 million monthly active users, and a ban on training other models with Llama output. The OSI also takes issue with Meta’s acceptable use policy’s prohibition of its use in things like “critical infrastructure” and “regulated/controlled substances.”

“Llama is one of the biggest offenders because, in their press release, they call Llama ‘open-source.’ And that’s absolutely not fair,” Maffulli said. “Other companies have been a little bit more cautious. But at the same time, the press have picked up the moniker of open-source, and they’ve been slapping it in regardless of what the original announcement from that company was.”

As of publication time, Meta had not provided comment.

IBM has also aligned with the more open-leaning side of the debate, regularly talking up the benefits of open-source. Together with Meta, the company spearheaded the formation of a group in December called the AI Alliance, which pushes for a more open approach to AI development. Other members include Databricks, Dell, Hugging Face, Intel, and Stability AI.

University of Illinois electrical and computer engineering professor Deming Chen, who’s helping to lead the AI Alliance, said the group is striving toward a more “balanced, transparent” AI space.

“People contribute open-source datasets and open-source models and the goal is to grow the community and benefit the community,” Chen told Tech Brew. “But of course, try to control the risk, and make all the AI solutions more reliable, more trustworthy.”

Which are not as open?

OpenAI and Anthropic have been notably protective of their model weights, often arguing that the code could do damage in the wrong hands. Given Microsoft’s reported $13 billion stake in the former and Amazon’s $4 billion stake in the latter, the Big Tech behemoths also seem to be somewhat aligned with that approach.

However, Microsoft has also made a much more minor investment recently with the open-minded French startup Mistral, and both tech giants partner with various open models through their respective cloud platforms.

Google’s flagship Gemini models are accessible by a proprietary API or direct interface only, but the company recently released a smaller open-weight model called Gemma designed to compete directly with Llama. A senior software engineer at the company fretted in a memo leaked last year that open-source AI could “eat our lunch.”

What’s next?

The OSI’s roadshow isn’t the only effort to move forward with a comprehensive definition of open-source AI. Bdeir and Mozilla also published a framework last month in collaboration with the Columbia Institute of Global Politics. And the Linux Foundation also rolled out a document called the Model Openness Framework this year.

As governments around the world push toward more regulation of AI, they have also grappled with a definition. The Biden administration, for instance, sought comment in February on the risks of open-weight systems.

“The most important thing is you make sure that you are getting balanced and diverse advice and a cohort of people around you that represents the ecosystem at large and not the major players only,” Bdeir said.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.