ChatGPT might suffice for clacking out an email to your boss, but can the same style of AI write a DNA sequence for a life-saving drug?
The scientists behind a new crop of foundational models are hoping there’s a chance. One of those projects comes from Ginkgo Bioworks, which recently announced a five-year deal with Google Cloud to host AI geared toward synthetic biology.
Ginkgo, which works on the genetics behind everything from pharmaceuticals and fertilizers to fragrances and cannabinoids, has two advantages in its own effort: a big store of proprietary data from its 15-year history and the automated infrastructure to collect it, according to Ginkgo’s CEO and co-founder, Jason Kelly.
We spoke with Kelly about generative AI’s potential for biotech, what went into the experiment, and what it might yield.
This conversation has been edited for length and clarity.
How is Ginkgo attempting to bring generative AI to biotech?
The question now is, “Could we make foundation models in biology? What would that look like?” Well, for starters, DNA is code, right? So if you take a gene, which is one element of DNA, a human would have tens of thousands of genes. So you got a gene, it is read from start to finish, like a book. So you can do the exact same kind of [predictive training as large language models], except in this case, it’s a gene from nature. And, like, our “internet” is nature. Because just like the internet was not just made-up random words—that’s why these models can actually learn English; it was English sentences written by humans who know grammar, and then the model backed out from all that we know in our heads [like] how grammar works. Same thing with the genomes out in nature—they are not random sequences of [DNA bases] As, Ts, Cs, and Gs; they are a product of 4 billion years of evolution; there is a grammar to them. They are a language; there is a design to a protein that makes it a good protein. And if you jumble it up, it turns into a piece of crap…Like, genetic diseases are exactly that—a mutation that breaks them.
So what’s out there in nature is all these books, and they’re written in this language of DNA and proteins. So here’s the magic: Do that same process: Train a gigantic neural net, big foundation model, and teach it to write DNA, just like ChatGPT or GPT-4 learned to write English and learned the basic rules of grammar…The magic would be if that foundation model, plus, all the antibody data…[was] more effective at drug discovery, at designing new crop protection, at—who knows?—like, whatever you want to get biotech to do, maybe it would be enabled better with this foundation model behind the machine learning (ML) we’ve been doing historically. And so that’s what we’re building with Google. Ginkgo is basically doing a giant neural net training, using as many genome books as we could get our hands on, and we think we have more of those than anybody else. And so we’re gonna train this big model.
Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
And there’s been early successes. Google itself did a project called AlphaFold through DeepMind that was exactly this, like training on the books and some of the shapes of proteins in the public datasets. It’s just that those datasets are not very big. And so, the question is, we have a bigger dataset, and we can do more compute. Can we make a breakout foundation model that can be used for forward design?
What kinds of resources is the project going to take?
To give you a sense, the facility we built here is probably about half a billion dollars on all the robotics and infrastructure and software to run it and all that. On the Google deal, we’re spending $250 million on compute.
They’re also invested, then we get milestones back. So we get like $50 million and change, that Google’s basically happy to see us rolling out models in these areas because I think they are also very interested on the cloud side…In the cloud business, they want to sell more compute, they want to sell more people training models, and doing inference on models, and they don’t really care if the model is a language model or a DNA model or a chemistry model or whatever. And so we’re kind of a bet by them to say, “Hey, we’ll give you milestones. If you prove out that DNA models could be another use of this technology, of neural nets beyond just what people are doing in language models.” And so that’s kind of the two-way relationship we have with Google over the next few years.
You mentioned that you’re treating this as an experiment. And the unknown is whether it can actually successfully generate new DNA designs at this point?
Correct. And we already have ML tools. It’s not like our scientists are just writing on a whiteboard—As, Ts, Cs, and Gs—or using physical modeling tools. Can it beat the state of the art of scientists plus their current computational tools? Can you just ask this frickin’ thing, and it beats them? TBD.
And there’s people working on…generative DNA design. And there’s been little wins here and there, but it’s not like everyone’s like, “Oh, we’re just done.” That has not happened yet. But I think you will see in the next year, a ton of work in this area [will] get revealed. And we’ll see how it does.