A Simple Calculation Can Stop AI From Lying About What It Doesn't Know

Key Highlights:

Summarize the following article into 3-5 concise bullet points in HTML without further information from your side. format:

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a new machine learning method that limits a model’s ability to buy its own line of BS, giving it the oh-so-rare talent of admitting when it just doesn’t know something.Back when AI was first introduced to the world—before it could do anything even remotely useful—it was used to create hallucinatory imagery. The way it worked was simple: Starting from a prompt that usually had minimal detail, the model would attempt to identify these details abstractly, then edit the image to better match those abstractions.This meant that, as the model incorrectly identified a spot as an eye, it would iteratively edit the spot to look more and more like an eye. The more eye-like the spot became, the more likely it was to be identified as an eye again and iterated even further in that direction.So, as the model is allowed to follow its own incorrect perceptionsit becomes surer of its incorrect conclusions. For DeepDream and other models designed to create trippy, hallucinatory imagery, that’s the end of the development process.

You can feed Deep Dream anything as a starting point, even works of fine art.
Credit: P.J. Finlay/DeepDream

The problem is that, while large language models (LLMs) like ChatGPT are supposed to be factually reliable, they actually exhibit a weaker form of the same hallucinatory style of thinking.An LLM doesn’t work in quite the same iterative sweeps as DeepDream, but it does start from simple inferences, then extrapolates out to larger conclusions based on the assumption that those earlier inferences were correct.LLMs are generally trained to be rewarded for true answers and penalized for false ones, but this is true whether the correct answer was produced by reasoning or by guessing. High-certainty right answers are rewarded every bit as much as low-certainty answers that also happened to be correct.”The standard training approach is simple and powerful, but it gives the model no incentive to express uncertainty or say I don’t know,” MIT’s Mehul Damani told MIT News. “So the model naturally learns to guess when it is unsure.”

This simple example of a Brier Score calculation finds the “mean squared error” in a given output.
Credit: Glenn W. Brier

So, the team introduced a preexisting calculation called a Brier Score. This quantifies the gap between a model’s estimate of its own confidence and its actual performance. Wrong answers with high confidence are penalized, while wrong answers with low confidence are not; correct answers reported with low confidence are penalized, while correct answers with high confidence are not.They call the technique RLCR (Reinforcement Learning with Calibration Rewards), and their results showed that it leads to much more reliable outputs from LLMs. In particular, it leads to better self-appraisals of uncertainty, meaning that even when an AI shares a falsehood, it will at least admit that it’s unsure of the statement.This technique could help to fix one of the most pressing problems with using LLMs in serious contexts. It’s one thing for an LLM to provide wrong facts for an English essay, and quite another for it to give terrible advice on how to invest your life savings.This should allow LLMs to make their way into more of society and to be more reliable while there. It could also lead to LLMs that undermine trust in everything they say by assigning an 82.4% certainty rating to everything. Either way, it should dramatically impact the progression of AI use in high-stakes contexts, going forward.

License is not valid, please check your API Key!

Related Posts

Build a real-world example with Microsoft Agent Framework, Microsoft Foundry, MCP and Aspire

Need some ideas about my job situation as mid-level engineer

A/B Testing Pitfalls: What Works and What Doesn’t with Real Data