Why AI models might seem to perform worse over time

Issues like data drift and increased hallucinations can arise as certain AI systems get more mileage on them.

An AI model brain hallucinating — Amelia Kinsinger

February 6, 2025

• 5 min read

Every time you swipe a credit card, most major financial companies will tap machine learning to generate a risk score—a likelihood that the purchase is fraudulent.

But over time, scammers may switch up their tricks or buying habits might change. Suddenly, the machine learning model is operating in a different environment than that on which it was trained. Maybe it’s flagging legitimate transactions or missing actual fraud as a result.

This is a phenomenon called “model drift”—a mismatch that emerges over time between what an AI system was trained to do and how it operates in the real world. And it’s not just a problem for credit card fraud detection, it can affect models of all kinds, including LLMs, according to Helen Gu, founder and CEO at InsightFinder, which works with Visa on avoiding these kinds of outcomes.

“Model drift is hard to detect because it’s not something you can actually clearly describe using one metric, and it’s basically a running metric…you have to compute this metric over a period of time,” Gu said.

It’s one of a few reasons that AI models might seem to degrade over time without constant monitoring and tweaking. As more businesses move from AI prototypes to full-fledged production, fending off these sorts of real-world quality issues may be more top of mind.

“Back in 2023, the focus was more around communicating that there is something called hallucination and why it could happen,” Amit Paka, founder and COO of the AI observability platform Fiddler. “Now [the conversation with clients is] more around, ‘What’s the quality of the hallucination metric that you have?’”

New models, new risks

Fiddler, which works with clients like the US Navy and Integral Ad Science, has been helping companies track performance issues around machine learning algorithms since 2018. But the rise of LLMs in the last couple years has made that mission trickier. Unlike previous generations of the technology, generative AI can hallucinate or leak personally identifiable information, and its inner workings are even less decipherable than black box machine learning algorithms.

“In some ways, the risk aperture has widened,” Paka said. “We had to build a complementary product that had to now handle a wider range of risks than would have been the case following machine learning.”

Paka said when companies create LLM-based apps, they tend to train them on a certain expectation of what a user might ask the AI. But the actual inputs in practice can stray from those predictions. And LLMs may tend to hallucinate more as they are pushed beyond a narrowly trained function.

One way Fiddler protects against these hallucinations is through a concept called LLM-as-a-judge, according to Paka. The company has trained a small language model solely for the purpose of evaluating how closely a search retrieval LLM’s answer matches the source material.

Paka said there’s also been more investment in AI observability in general, including of machine learning algorithms, as AI has garnered bigger budgets since the LLM revolution.

Dumbing down?

At a more foundational level, AI users have occasionally pointed to anecdotal onsets of “laziness” or deterioration in models like OpenAI’s ChatGPT or Anthropic’s Claude.

At least one study has traced these fluctuations as well. A 2023 pre-print paper from researchers at UC Berkeley and Stanford traced significant drifts in how well GPT-3.5 and GPT-4 performed across tasks like code generation, opinion surveys, and US Medical License tests.

While these types of reports have been more anecdotal and speculative, these systems are vastly complex, and tracing why issues like these might emerge is currently virtually impossible amid billions of parameters, despite the emerging field of mechanistic interpretability, which tries to make sense of exactly which nodes correspond to which concepts in an LLM.

Researchers have also identified a serious potential risk as more and more of the internet’s available public training data becomes infused with AI-generated output. Models that train on AI-generated data are subject to a problem called model collapse, wherein “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models.”

Florian Douetteau, CEO of AI data platform Dataiku, said model collapse is likely avoidable for now as long as tech companies are continuing to sift for high-quality data. He said cognition issues in foundation models are also more likely to be quirks of individual iterations, but companies should still have tools in place to monitor quality performance over time.

“Model degradation itself as, let’s say, a systemic case for mankind, is probably not a real risk compared to many others,” Douetteau said. “But as an individual or enterprise consuming models, you need to essentially put tests in place in order to manage the model supplier you’re using.”

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.