A paper from the European Commission last month may have marked one of the first times that the phrase “hella swag” has appeared in the lawmaking body’s policy research.
HellaSwag—the name is actually a mouthful of an acronym—is one of many go-to benchmarks that AI developers use to gauge the performance of their models. You’ve probably seen them listed like report cards in OpenAI or Google announcements.
These benchmarks influence how companies train and develop models. And with regulations like the European Union’s AI Act now a law, policymakers are also turning to some of these measurement schemes to classify AI systems.
The only problem? Experts say many of the widely used benchmarks leave a lot to be desired. They might be easily gamed or outdated, or do a bad job of taking stock of a model’s actual skills. They may evaluate capabilities that are largely irrelevant in the way people actually use the AI. And lately, as the pace of AI development has accelerated, researchers are having an increasingly hard time devising tests that AI can’t quickly master.
“If you look at [common benchmarks] from a technical perspective, they’re actually not that good,” Anka Reuel, a graduate fellow at the Stanford Institute for Human-Centered AI, told Tech Brew. “It’s kind of like the Wild West when it comes to benchmarks and actually, [evaluation] design more broadly, which is a huge issue right now. Because as a community, we never really put a focus on how to design them.”
Grading the graders: Reuel was attempting to implement various AI benchmarks in the course of her research when the frustration of that process drove her to pivot into a whole separate project. She spearheaded a paper late last year called BetterBench that evaluates different popular benchmarks across usability, design, and overall quality.
Stanford claims it’s the first measure to grade AI benchmarks in this way. The paper finds that popular benchmarks “vary significantly in their quality,” and some of the most common ones fare poorly across usability and design. Reuel said she and her co-authors provided 46 best practices for designing better benchmarks grounded in research on the history of benchmarking in AI and across other fields.
But moving on from entrenched and widely accepted benchmarks can be a difficult process for developers, Reuel said.
“It’s kind of like inertia to try something new. A lot of new benchmarks are hard to implement,” Reuel said. “So if you already have a working implementation, you don’t even bother because it can literally take months and tons of compute to get benchmarks to work.”
Keep up with the innovative tech transforming business
Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.
A+ students: Perhaps more unnervingly, benchmark creators are now having trouble devising tests that actually challenge new AI models for long. That was the impetus for a recently released proposed benchmark called Humanity’s Last Exam (HLE), which collected nearly 3,000 multiple choice and short answer questions from top experts in fields from physics and math to humanities and social sciences. Researchers at the Center for AI Safety (CAIS) and Scale AI led the project.
“With models advancing—more computing power and better data—they are now routinely ‘acing’ earlier benchmarks like MMLU, making it more challenging to have a measure of AI progress,” Alice Gatti, a CAIS researcher and author on the paper, told Tech Brew in an email. “If the models ‘ace’ HLE, they effectively would be capable of answering questions on any topic more accurately than human experts and we will need to look for other ways—more qualitative ways—to measure the capability of AIs.”
OpenAI claimed its Deep Research tool scored 26.6% on Humanity’s Last Exam, but Gatti said this isn’t totally unexpected, considering the model had access to internet search. “We did not allow the models to use search in HLE and we ‘Google-proofed’ our questions,” she said.
Still, Gatti said the team “wouldn’t be surprised to see the scores rise quickly and potentially surpass 50% by the end of the year.”
“Right now, Humanity’s Last Exam shows that there are still some expert closed-ended questions that models are not able to answer,” Gatti said. “We will see how long that lasts.”
‘Yes-men on servers’: But is having an AI pass progressively harder tests the only or best way to level up its intelligence? Thomas Wolf, the co-founder and chief science officer of Hugging Face, challenged this notion in a widely read essay earlier this month.
Wolf proposed that in order for AI models to yield scientific discoveries, they need to move beyond being “A+ students” or “yes-men on servers” and learn to make bold inquiries or use their vast trained knowledge to challenge conventional wisdoms. This type of AI thinking would likely require novel new benchmarks to strive for, he said.
“Once you have a nice measure of something, you have this North Star,” Wolf told Tech Brew. “Right now, we’re not really measuring if this model asks good questions, we’re measuring…good answers.”