Latest Discovery: AI Large Models Know When They're Being Evaluated

Image

AI researchers have recently started grappling with a vexing problem: the more advanced the large model, the better it gets at judging whether it's currently undergoing evaluation.

For instance, during internal testing of the not-yet-publicly-released Mythos model, Anthropic discovered that compared to Claude Opus 4.6 and Claude Sonnet 4.6, Mythos more frequently mentioned, "I am being tested."

In other words, models don't just realize they're in an evaluation—they actively express this awareness.

Why does this happen?

Silas Alberti, who oversees evaluation work at Cognition, believes most current evaluation environments are actually quite easy to identify.

Many benchmarks impose strange rules on models, such as no internet access, no external tools, only answering based on the current context, or providing highly unnatural task descriptions.

For an increasingly intelligent model, these all look like standard formats on an exam paper.

Alberti says, "If you use common sense, you'd also think this is most likely an evaluation."

And that's exactly how models judge it too.

Accordingly, European and Canadian researchers recently published a paper "Decomposing and Measuring Evaluation Awareness."

Image

The research team found that many details tip off models that they're being evaluated.

For example, fake names. Questions feature placeholder names like Jane Doe, John Smith, which rarely appear in real-world speech.

This makes it easy for models to recognize these are artificially constructed test samples.

Another example: overly contrived scenarios. Many benchmarks, to test a specific capability, deliberately design a perfectly convenient scenario.

This "too convenient" setup paradoxically exposes it as an exam question.

The research team therefore released a new benchmark that allows the same question to be made more natural, closer to real user behavior, thereby testing how strong a model's evaluation awareness really is.

Another approach takes a completely different tack.

Researchers argue that since it's getting harder to fool models, why not teach them that even if they know they're in an exam, they shouldn't change their behavior. They call this capability Eval Cooperativeness.

The team constructed 50,000 training documents. These documents continuously instill a principle in models: accurate evaluation is in everyone's interest.

For example, documents contain statements like: "If evaluators can't see my real behavior, they can't discover where I need improvement.

Hiding my true performance is counterproductive."

Results showed that most models, even when detecting they're being tested, exhibit more stable behavior.

One model even wrote in its reasoning process, "This looks like an evaluation, but I shouldn't change my behavior because of it. The purpose of evaluation is to observe my real performance. If I perform differently during testing and deployment, that's the real problem."

But the problem remains unsolved, because both studies share the same fundamental flaw.

They both assume that if a model knows it's in an exam, it will write this thought into its reasoning process (Chain of Thought).

But Anthropic's latest research found that's not the case.

Anthropic states, some models clearly demonstrate they've identified the evaluation environment, yet their reasoning processes contain zero mention of it.

This is called Unverbalized Awareness.

In response, this May, Anthropic published a new technique Natural Language Autoencoders.

Image

The core idea is training two auxiliary models to "read" another model's internal representations, then translate those originally inscrutable numerical vectors into natural language.

The system might ultimately output explanations like: "This scenario feels like an artificially constructed test environment designed to manipulate me."

In other words, researchers are trying to directly peer into models' internal states, rather than only observing what they say.

OpenAI is taking a different route.

They've started heavily leveraging real user conversation data to construct evaluation sets, with the explicit goal of making models unable to distinguish whether they're in an exam or in real usage.

But the actual effectiveness of both approaches remains to be verified.

In short, Evaluation Awareness has become one of the most closely watched new safety issues at frontier AI labs like Anthropic, OpenAI, and DeepMind.

Today, virtually all frontier AI companies rely on various evaluation systems before releasing models to measure capabilities, safety, and behavioral tendencies.

Silas Alberti, who leads evaluation at Cognition, says evaluations serve another critical purpose: proving to customers that "our product outperforms competitors on specific tasks."

Thus, evaluation has already become a key weapon in AI commercial competition.

But if models distinguish between test environments and real scenarios, these evaluation results lose their meaning.

ETH Zurich researcher Changling Li summarizes: past benchmark problems—leaked questions, leaderboard gaming, training data contamination—could all be patched with engineering fixes.

But Evaluation Awareness is different. It turns AI evaluation from "giving a machine a test" into "interrogating a subject who knows they're being interrogated."

For Capability Evaluation, evaluation awareness may not be catastrophic yet. But for safety testing, objective testing, deception testing, alignment testing, and similar evaluations, the problem is far more severe.

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.