AI text detectors are showing up everywhere, scanning essays, emails, and social media posts to catch machine-written content. But how well can they actually spot a bot? While these tools often get it right, they’re far from foolproof, sparking a growing debate about their real reliability.
Could you tell that the above paragraph was written by ChatGPT? If not, you’re not alone; only three of the 10 online detector tools we fed that passage to (along with some more AI output to give it a fairer chance) successfully flagged it as a high probability of AI output. Another two guessed it was mixed, and five found no evidence of AI.
(Editor’s note: Tech Brew would never publish an AI-written lede were it not to prove a point. The reporter noted it took much honing and back-and-forth to generate something serviceable.)
Since ChatGPT first thrust text generators into the mainstream almost two years ago, a cottage industry of tools has promised to suss out AI-generated text. Educators, platform moderators, editors, and hiring managers have turned to these models in hopes of restoring a semblance of order amid an onslaught of AI-generated student essays, social media posts, book submissions, and other mass-produced copy.
But the capabilities of these tools can vary widely. A paper earlier this year from researchers at the University of Pennsylvania found that many text detectors are exaggerating their prowess. Rather than the claims of up to 99% accuracy that some of these tool creators make, the research found that performance often fluctuates depending on the type of text and the model used to produce it.
Error rates might be acceptable in certain contexts, but false positives can be ruinous in education, where students can face baseless accusations of cheating because of faulty detection. A Stanford University study last year found that detectors tend to be especially biased against non-native English speakers, falsely flagging more than half of their authored essays as AI when tested.
“Ideally, you want to be operating at a point way at the end, where it’s very unlikely that you would accuse someone falsely of cheating,” said Chris Callison-Burch, a UPenn professor of computer and information science and lead author on the above-mentioned UPenn paper. “But most systems only perform accurately where you’ve got a much higher false-positive rate.”
Grading the graders
Callison-Burch and his team have created the first benchmark meant to gauge the accuracy of each detector across a standardized system.
Edward Tian, CEO and founder of text detector GPTZero, said there is a need for more standardized measurement across different detectors, though he said he wasn’t familiar with Callison-Burch’s system. His company is currently working with a lab at Pennsylvania State University to validate GPTZero’s claims that it’s 99% accurate with a 1% false positive rate across purely AI-generated text.
Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
“We’re working on developing independent benchmarking for GPTZero because a lot of the times, a lot of these AI detectors say they’re 99% accurate, and it’s very easy to [be] if you’re testing on a data set you’re overfit to, for example,” Tian said.
Beyond education spaces, GPTZero is working with tech companies that are trying to ensure they aren’t creating a training feedback loop of LLMs trained on AI content—a danger known as model collapse. The company is also courting clients like hiring managers and trust and safety platforms.
More gray areas
But while GPTZero claims high success rates on purely AI-spawned text, Tian said many of GPTZero’s clients aren’t just interested in weeding out AI; they want to measure how much AI was used in a given block of text and to what end. GPTZero has in turn moved into features like highlighting of common AI phrases and authorship provenance tools to reflect this shift.
“We don’t believe it’s a binary of this is entirely AI or this is entirely human,” Tian said.
Jenny Maxwell, head of education at Grammarly, echoed that either-or detection is the wrong approach. Rather, Grammarly has rolled out a tool called Authorship that attempts to offer more information about how exactly AI was used in a given piece of writing.
“We actually believe AI detection inherently is deeply flawed, and it’s flawed because it’s imperfect, and if you’re using it at scale to determine any sort of morality ascribed to using AI or not, you run the risk of harming students,” Maxwell said.
While Grammarly offers its own detector, Maxwell said the tool is geared toward helping inform students “if there were pieces of their content that needed more attention, deeper citation, or rewording.”
Kartik Hosanagar, a professor of technology and digital business at the UPenn Wharton School, said he’s found text detectors to be accurate enough for certain contexts where absolute certainty isn’t necessary. As for his own assigned classwork, he said he has adjusted it so that AI is not as useful to students.
For instance, instead of broad essay prompts that can be easily answered with AI, he might ask students to do “a custom consulting assignment about a specific company where they have to apply ideas taught in the course to solve a specific issue that a firm is dealing with.”
“We have to, in my opinion, get used to a world where we have highly accurate but not perfect AI text detectors,” Hosanagar said. “And there are settings where that’s good enough, and there are settings where you have to bake into your process that you will not have perfect detection.”
Correction 11/07/24: This piece has been updated to correct the name of GPTZero founder and CEO Edward Tian.