A primer on synthetic data, which is gaining steam for AI

By 2024, 60% of all AI training data may be synthetic, according to Gartner.

article cover — Animation: Dianna “Mick” McDougall, Photo: Getty Images

April 15, 2022

• 5 min read

Like wax museum celebrities or cake versions of household objects, synthetic data can be difficult to distinguish from the real thing.

You can think of it as data that reads, looks, or acts like it’s been collected from real people, when in actuality, it’s been created by artificial intelligence. Here’s how it works: Deep-learning algorithms train on real world data, then do what models do best—flag patterns and trends. Then, they use those patterns to create an entirely new set of data without ties to any real individuals.

The goal? Preserve both the data insights and personal privacy.

Early industry adopters include highly regulated fields like health care, insurance, banking, and telecommunications. And investment into synthetic-data startups is flowing: In October, Facebook acquired AI.Reverie for an undisclosed amount. In January, Mostly AI raised $25 million. And last month, Synthetaic raised $13 million, followed by Datagen announcing a $50 million funding round.

In 2021, the synthetic data market was worth $110 million, according to a snapshot of 76 companies by Cognilytica, an AI analytics firm. By 2027, the firm projects, synthetic data will be worth $1.15 billion. And by 2024, Gartner predicts, 60% of all AI training data may be synthetic.

Synthetic data can take the form of tabular data (think: purchase records, financial transactions), images (think: AI-generated faces and portraits), or video (think: 3D simulations used to train autonomous vehicles). In this piece, we’ll mostly focus on tabular data.

Why synthetic is picking up steam

Personal data has been a hot commodity for as long as commerce has existed—for retailers, it helps inform store inventory; for financial institutions, it plays a part in credit-card and loan offers; and for apps and other businesses, it’s something they can monetize directly.

Modern AI systems need more and more high-quality data to train on, and large amounts of that kind of data are difficult to locate and prone to bias, and, at the same time, stricter privacy laws like the California Consumer Privacy Act and the EU’s General Data Protection Regulation have gone into effect. Some see synthetic data as the solution.

On the privacy front, research suggests how easy it is to re-identify anonymized data, no matter how much information on an individual has been deleted or tweaked.

“Techniques [that] were developed in the era of small data just don’t work for this rich behavioral data that we have nowadays because everybody has this unique digital fingerprint that makes it so easy to reconnect a supposedly traditionally anonymized data set with some other information floating around in the web,” Alexandra Ebert, chief trust officer at Mostly AI, a Vienna-based synthetic data company, told us.

What’s more, data collected from the real world typically reflects racism, sexism, and other forms of bias found in society—and the algorithms trained on that data can learn and propagate those patterns on a large scale. Synthetic data could help make it possible to conduct more in-depth audits of AI systems than is possible with traditional data.

“One of the biggest challenges that fairness practitioners with big organizations face nowadays is that they actually don’t even know whether the systems are exhibiting bias or not,” Ebert said. “Reason being is that to know whether you’re discriminating against a certain user group, you need to know in which group a given user falls, and in many regulatory environments, it’s prohibited to use data about ethnicity, gender, and so on and so forth. So they’re kind of operating in the blind.”

Synthetic data could allow external auditors to look at how an algorithm is treating vulnerable populations without that data being tied to any real individuals.

But the tech still has its own potential problems: Although creating synthetic data from real data can help assuage privacy concerns, what’s to stop it from inheriting the same biases the original data had? That’s something the emerging field of “fair synthetic data” aims to take on, Ebert said, though it’s “not a silver bullet.”

In the near future, Vivek Muppalla, director of synthetic services at Scale AI, which recently launched a synthetic-data product, foresees vertical consolidation in the synthetic-data space. As it stands now, some companies specialize in data labeling, while others specialize in data management or synthetic-data creation.

“The big thing I would see, in the next three to five years, is…products that tackle the full spectrum of the challenge,” Muppalla said, adding that the spectrum would include data-management platforms with insights about data, quality of data, and quality of labels, as well as recommendations for how to procure more real-world or synthetic data, and even the ability to generate synthetic data on a user’s behalf.

“That, for me, Muppalla added, “is the biggest trend—how we blend, effectively, real-world and synthetic data, versus thinking about it in siloes.”—HF

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.