Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
A key group in the open-source community is taking a step toward setting the terms of a debate that’s roiled the AI space.
The Open Source Initiative (OSI), the organization widely seen as responsible for arbitrating openness standards, published the latest version of its definition of open-source AI on Thursday. The document comes after months of consulting with various developers, academics, and other concerned parties on a roadshow of workshops around the world.
While there are still more of those tour stops to come, Ayah Bdeir, a senior advisor on AI strategy at Mozilla Foundation who played a role in the process, said she doesn’t expect the definition to change much between now and when the “stable version” of the definition is presented in the fall.
What’s at stake? The question of how open the components of generative AI models should be has split the tech industry.
On one side, companies like Meta and IBM advocate for certain degrees of openness, while the likes of OpenAI and Anthropic warn it could be dangerous. The rift between advocates of openness and the more closed approach to AI is further complicated by the lack of an agreed-upon definition of open-source AI. Unlike more straightforward software, what’s needed to replicate or modify AI is more open to debate and could depend on the particular use case or model.
The new definition says that two of the three broad ingredients for an AI model—the source code needed to train and run the system and the model weights and parameters—should be freely available through OSI-approved licenses.
The third piece of the definition, covering the training data, is more complicated; it holds that open-source AI must include information about training data sufficient enough that “a skilled person can recreate a substantially equivalent system using the same or similar data.” The definition suggests that this information about the training data—though not the data itself—should also be distributed through open-source licenses.
The data requirement was the most “hotly debated,” according to Bdeir. While open-source purists lobbied for full dataset release, the OSI recognized that doing so would raise copyright and other concerns for developers, Bdeir said.
“How can the maker of an AI system distribute data that maybe they don’t have distribution rights to?” Bdeir said. “Opening up all the data could set them up for legal risk.”
She hopes this definition will bring some much-needed clarity to the debate.
“The bigger goal of this definition is to really put a fine point on who’s calling themselves open-source and who’s not,” Bdeir said. “Having some rigor around that definition allows some of the regulations that are upcoming to be able to point to an objective metric that can measure that process and not have some hand-wavy ability to slap a label on your AI system.”