Why Databricks tapped its entire workforce to write training data for a ChatGPT rival

The startup claims Dolly 2.0 is the first open-source model of its kind for commercial use.

April 18, 2023

• 5 min read

Enterprise tech startup Databricks recently gave its 5,000 employees an unusual task: Write up random bits of question-and-answer dialogue, text summaries, creative prose, and anything else that might be useful for an AI to know about the world.

The goal was to collect enough data to train a machine learning model that could serve as a smaller but less-expensive alternative to OpenAI’s enterprise-level ChatGPT, one that businesses could use to build their own AI language programs.

“We’re kind of doing a reverse Jeopardy,” Databricks CEO Ali Ghodsi told Tech Brew. “We’re trying to show the model how to behave.”

Databricks released the result last week in the form of Dolly 2.0, which the company claims is the first open-source instruction-following large language model (LLM) trained on human-generated data that’s available for commercial use.

By making the model open-source, the goal is to give businesses the ability to build and own affordable AI language models without third parties or API fees. It can be used for enterprise tasks ranging from aspects of drug discovery to insurance underwriting, Ghodsi said.

The companywide data creation process was key to making that happen. Many previous open-source ChatGPT competitors—including the first version of Dolly, released last month—are trained on output from ChatGPT itself and thus barred from commercial use by OpenAI’s terms and conditions, which forbid the system’s data from being used to train models that compete with OpenAI.

Word games

By tapping Databricks’ workforce, the company found a way to quickly build training data without the use of ChatGPT. Ghodsi said management created a process that incentivized workers to submit examples of dialogue and tweak those of their peers in a “gamified” manner. The exercise ultimately produced 15,000 prompt-and-response pairs, he said.

There are risks to training AI this way. For one, Ghodsi acknowledged that his workforce isn’t fully representative of the global population. He contends that it could still be suitable for the types of enterprise use cases imagined for Dolly, and he hopes that more sample data will be added to Dolly as others adopt it for their own uses.

“Whenever you have a sample, you have a sample bias,” Ghodsi said. “This particular dataset is close to 5,000 employees across 40 countries and many different functions. But it’s also a tech company and mostly male employees—very few doctors, very few musicians, and so on and so forth.”

While OpenAI’s models have certain guardrails in place intended to prevent the AI from veering into sensitive topics, Dolly’s documentation warns that the model may produce “objectionable content.” That’s because the foundation for Dolly is derived from a massive open-source web-scraped data set called “The Pile” that isn’t filtered for offensive content. Databricks used the data produced by the employees to train this base model through a process that AI practitioners call fine-tuning.

Ghodsi claimed these guardrails are of less concern to Dolly’s users, who are more likely to be focused on enterprise applications.

“Most of our customers actually have very specific use cases that are enterprise use cases,” Ghodsi said. “The use cases for an enterprise setting are very different from the kind of consumer application that ChatGPT is, where you put it online and anyone can use it.”

Businesses beware

Constellation Research VP and Principal Analyst Andy Thurai said the model’s limited number of parameters—12 billion overall versus GPT-3’s 175 billion—will limit its versatility. He also noted that business users should look to put in their own processes to guard against potentially obscene AI output.

“Because of the smaller nature of the parameters and the training set, the responses can be rude, short, toxic, and offensive to some,” Thurai said in an email.

Because of these limitations, Thurai advises only companies with a solid foundation in machine learning to attempt to turn Dolly into an AI tool of their own.

“This is a good first step for any enterprise which has a solid MLOps process in place and knows how to create, customize, manage, retrain, and govern ML models on their own,” Thurai said. “For someone who is just starting out, this is a very dangerous experiment to try, which can cost them more in the long run.”

Dolly’s introduction comes as tech giants like Google, Microsoft, and Meta have been jockeying to turn LLMs into business products since Microsoft-backed OpenAI’s ChatGPT pushed the technology into the mainstream late last year.

While smaller than many of its rivals in the space, Databricks sports a reported valuation of around $31 billion as of last fall, according to The Information, and claims to serve as a data storage provider for more than half of the Fortune 500. That reach could make it easier for Dolly to integrate with enterprise clients' existing data infrastructure.

“This is a good first step in showing enterprises how they create models on their own, own the models outright, and customize or tweak the models as necessary without the need to pay API access fees or share data with LLM providers, which can be a huge issue for certain enterprises in the regulated industries,” Thurai said.

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.