As large language models evolve from simple text Q&A to complex multi-step tasks, the traditional LLM-as-a-Judge evaluation method is exposing fatal flaws: single-inference evaluation is prone to bias, cannot verify correctness in professional domains, and suffers from cognitive overload when facing multi-dimensional assessments. This review paper from institutions including The Hong Kong Polytechnic University systematically organizes the emerging paradigm of Agent-as-a-Judge for the first time, revealing the evolutionary path of AI evaluation from "passive observation" to "active verification."
The Three Dilemmas of Traditional LLM Judges
The paper points out that as generative AI applications evolve from simple text responses to complex multi-step tasks across professional domains, the reliability of LLM-as-a-Judge faces fundamental challenges.
First, inherent parameter bias. A single evaluator tends to favor verbose answers or its own output patterns, compromising neutrality when evaluating high-complexity responses that deviate from the training distribution.
Second, the limitation of passive observation. A naive LLM judge cannot react to real-world observations; it evaluates answers based solely on language patterns without verification, leading to "hallucinatory evaluation" in professional domains.
Third, the cognitive overload problem. In tasks requiring multi-dimensional evaluation criteria, traditional LLM judges attempt to comprehensively assess all dimensions in a single inference, resulting in coarse-grained scores that fail to reflect specific nuances.
[Figure 1: Comparison of LLM-as-a-Judge and Agent-as-a-Judge] The paper uses a comparison diagram to illustrate the core differences between the two paradigms: the former performs direct single-inference evaluation, while the latter leverages planning, memory, and tool-enhanced capabilities to achieve enhanced evaluation.
The Triple Evolution of Agent Judges
The paper analyzes the paradigm shift from LLM-as-a-Judge to Agent-as-a-Judge from three dimensions.
Evolution of Robustness: From Monolithic to Decentralized. To mitigate the inherent parameter bias of monolithic LLM judges, Agent-as-a-Judge employs specialized decentralized agents that collaborate through autonomous decision-making. This decentralized architecture facilitates the injection of expert prior knowledge: by decomposing complex evaluation objectives into sub-tasks or constructing specific interaction workflows, domain-specific constraints that are typically overlooked by general models can be enforced. Multi-agent deliberation ensures collective robustness, as different roles can isolate specific information points to neutralize biases.
Evolution of Verification: From Intuition to Execution. Static LLM judges are essentially passive observers, unable to react to real-world feedback. Agent-as-a-Judge bridges this reality gap by replacing intuition with execution. By interacting with the external environment, agentic judges can query system states to verify side effects, use code interpreters or theorem provers to verify logical consistency, and use search tools to anchor factual claims in real-time documents.
Evolution of Granularity: From Global to Fine-Grained. Agent-as-a-Judge addresses the cognitive overload problem by transforming evaluation from single-inference to autonomous hierarchical reasoning. Agentic judges can dynamically select or create task-specific evaluation standards, autonomously plan evaluations to independently check each component of the evaluation object, use memory to track historical reasoning states, and synthesize fragmented evidence into coherent judgments.
Three-Stage Development Roadmap
The paper summarizes the development of Agent-as-a-Judge into three progressive stages.
Procedural Stage: Decomposing monolithic inference into predefined agentic workflows, or structured discussions between fixed sub-agents. These systems achieve complex judgments through coordinated multi-agent interactions but are limited by predetermined decision rules that cannot adapt to new evaluation scenarios.
Reactive Stage: Routing execution paths based on intermediate feedback and invoking external tools or sub-agents to achieve adaptive decision-making. However, this reactivity is still limited to conditional routing within a fixed decision space, lacking the autonomy to improve underlying evaluation standards.
Self-Evolving Stage: Representing the frontier of the field, characterized by high autonomy and the ability to improve internal components during runtime—immediately synthesizing evaluation standards and updating memory with learned experiences.
[Figure 2: Agent-as-a-Judge Classification System] The paper constructs a complete classification system, organizing methodologies and application areas, with a background gradient showing the coverage of development stages from procedural to reactive to self-evolving.
Five Core Methodologies
The paper categorizes Agent-as-a-Judge methodologies into five dimensions.
Multi-Agent Collaboration includes two topologies: Collective Consensus uses horizontal debate mechanisms with agents representing different perspectives to counteract the inherent biases of a single LLM evaluator; Task Decomposition employs a "divide and conquer" strategy, delegating different sub-tasks to specialized agents for systematic evaluation.
[Figure 3: Multi-Agent Collaboration Paradigm] The paper demonstrates the specific implementation methods for the two collaboration topologies: collective consensus and task decomposition.
Planning Capability is reflected in two aspects: workflow orchestration evolves from static decomposition to dynamic multi-round planning; evaluation standard discovery enables the judge agent to autonomously formulate and improve evaluation standards, which is a hallmark capability of the self-evolving stage.
Tool Integration is a defining capability of Agent-as-a-Judge. The paper categorizes it into two types of uses: evidence collection (e.g., code execution feedback, visual model signals) and correctness verification (e.g., theorem provers, search engines, Python interpreters).
[Table 1: Tool Integration in Representative Agent-as-a-Judge Methods] The paper classifies representative methods by primary tool usage, covering systems such as Agent-as-a-Judge, HERMES, VerifiAgent, and Agentic RM.
Memory and Personalization support multi-step reasoning and consistent judgment: intermediate state memory retains intermediate states generated during the evaluation process, providing necessary context for conditional routing in reactive Agent-as-a-Judge; personalized context memory retains user-related information to regulate evaluation during interactions.
Optimization Paradigms are divided into training-time optimization (updating model parameters via supervised fine-tuning and reinforcement learning) and inference-time optimization (controlling judgment generation methods via prompts, workflows, or agent interactions).
Wide Range of Application Domains
[Figure 4: Overview of Agent-as-a-Judge Application Domains] The paper shows fine-grained task categories in general and professional domains.
In general domains, Agent-as-a-Judge has been applied to mathematics and code evaluation (e.g., HERMES anchors reasoning through formal proof steps), fact-checking (e.g., FACT-AUDIT models fact-checking as an agentic loop of multi-agent collaboration), dialogue and interaction evaluation, and multimodal and visual evaluation.
In professional domains, the paper organizes applications in medicine (e.g., MAJ-Eval constructs multi-evaluator roles for debate and cross-validation), law (e.g., AgentsCourt introduces an adversarial debate framework), finance (e.g., FinResearchBench extracts logical trees from reports as intermediate structures), and education (e.g., Grade-Like-Human decomposes scoring into a staged process).
Challenges and Future Directions
The paper identifies four major challenges facing Agent-as-a-Judge: computational cost (more expensive for both training and inference), latency (sequential inference steps and external tool calls introduce additional delays), security (tool-enhanced judges may access external systems, expanding the attack surface), and privacy (persistent memory or personalized evaluation may increase the risk of sensitive data leakage).
Future directions include: personalization (proactively managing the lifecycle of user-specific knowledge), generalization (dynamically discovering and adapting evaluation standards), interactivity (evolving from passive observers to evaluators that actively interact with the environment and human stakeholders), and optimization (shifting from inference-time engineering to training-based optimization).
The core insight of the paper is that the next-generation judge agent must transcend fixed protocols and become truly autonomous entities capable of self-directed adaptation, proactive context management, and continuous self-improvement, ultimately realizing the full potential to co-perceive, co-reason, and co-evolve with the models being evaluated.
Paper Title: A Survey on Agent-as-a-Judge
Paper Link: https://arxiv.org/pdf/2601.05111