KARL: Knowledge Agents based on Reinforcement Learning

An open-source model fine-tuned based on GLM 4.5 Air has achieved performance on par with Claude Opus 4.6 across six knowledge retrieval and reasoning tasks at roughly one-third of the cost—this is the report card delivered by Databricks' latest released KARL system. Against the backdrop of major LLM manufacturers competing to stack parameters and inference budgets, KARL uses reinforcement learning to prove a more economical path: instead of letting general models brute-force search, it is better to teach the model to search efficiently.

Core Challenges Facing Knowledge Agents

The paper focuses its objectives on a class of tasks called "grounded reasoning"—where models need to retrieve information in multiple steps from external document collections and perform complex reasoning based on the collected evidence. This type of task holds significant economic value in fields such as finance, law, healthcare, and manufacturing, because enterprises rely on vast amounts of private data never seen during model training.

The paper points out that compared to mathematical or code reasoning, academic research on the front-line capabilities of grounded reasoning is severely lacking. Existing "deep research" agents rely on public web knowledge and black-box search tools, and it is unclear whether their results can transfer to other grounded reasoning tasks. Furthermore, the search capabilities required for different scenarios vary hugely: constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, process reasoning on technical documents, etc.; systems optimized for a single scenario offer no guarantee on other scenarios.

KARLBench: Unified Evaluation of Six Search Capabilities

To systematically evaluate grounded reasoning capabilities, the paper constructed the KARLBench evaluation suite, covering six tasks, each isolating a unique capability: BrowseComp-Plus (constraint-driven entity search, 830 questions), TREC-Biogen (cross-document report synthesis, 65 questions), FinanceBench (long-document tabular numerical reasoning, 150 questions), QAMPARI (exhaustive entity search, 1000 questions), FreshStack (technical document process reasoning, 203 questions), and the internally developed PMBench (fact aggregation in enterprise internal notes, 57 questions).

[Table 1: Task Capability Examples] Each dataset isolates a unique structural challenge, from constraint-driven entity search to exhaustive fact search in enterprise internal notes.

[Table 2: Dataset Statistics] The number of questions, indexed document chunks and their average token counts in each evaluation set, as well as the average number of relevant chunks and answer nuggets per question.

All tasks uniformly use a nugget-based completion evaluation framework, and agents are equipped with only a single tool—vector search—to isolate the retrieval and reasoning capabilities themselves.

Training Method: Agent-Style Data Synthesis plus Offline Reinforcement Learning

The paper's training process is divided into three core stages.

Step 1: Agent-Style Training Data Synthesis. The paper developed a two-stage pipeline. In Stage I, a synthesis agent dynamically explores the corpus through a vector search tool, generating Q&A pairs based on retrieved evidence, which are then filtered by a deduplication agent against duplicates in the evaluation set. In Stage II, multiple Solver Agents independently attempt to answer the synthesized questions; the paper filters out samples that are too simple (all correct) or too difficult (all incorrect) based on empirical pass rates, retaining only the medium-difficulty data with the richest learning signals. Finally, a Quality Filter Agent screens out ambiguous questions and incorrect annotations.

[Figure 2: Stage I Synthesis Pipeline] The Q&A generation agent explores the corpus and proposes synthetic Q&A pairs, and the deduplication agent filters out duplicates with the test data.

[Figure 3: Stage II Solver Pipeline] Multiple Solver Agents independently generate solutions, extreme values at both ends are filtered, and the Quality Filter Agent further screens out ambiguity and errors.

Step 2: OAPL Offline Reinforcement Learning. The paper proposes OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy), a post-training paradigm based on large-batch iterative offline RL. Its core idea is: given grouped rollouts generated by a reference policy, learn the optimal policy by minimizing a least-squares regression loss regarding the optimal advantage function. This design is naturally off-policy, eliminating the need for heuristic tricks often required when training large-scale MoE models with online GRPO, such as clipping importance weights, data deletion, or router replay. The paper also incorporates the compression step into RL training, allowing the model to learn context management end-to-end. In experiments, a maximum of 3 rounds of iterative training were executed.

Step 3: Multi-Task RL. The paper selects BrowseComp-Plus (deep search) and TREC-Biogen (broad search) as in-distribution training tasks, simply merging the losses of the two tasks and balancing the number of training tokens. Compared to multi-expert distillation schemes, multi-task RL demonstrates better generalization capabilities on out-of-distribution tasks.

Test-Time Compute: Parallel Thinking and Value-Guided Search

The paper explores two Test-Time Compute (TTC) strategies. Parallel Thinking allows the model to generate N independent rollouts, which are then aggregated into a final answer by the same model. The aggregator can not only select from candidates but also synthesize multiple rollouts to generate a better answer—on PMBench, the aggregated answer was superior to any single candidate in 23.7% of cases with 5 parallel rollouts. VGS (Value-Guided Search) involves training a small value model (Qwen3-4B) to predict the future success probability of partial rollouts, used for branch selection in tree search.

[Figure 4: Parallel Thinking Method] Generate N responses and aggregate; the solver agent and aggregator agent use the same model.

Core Experimental Results

[Table 4: Main Results] Performance of various models on KARLBench, including single-task RL variants, multi-task RL, and different scales of parallel thinking.

The paper uses GLM 4.5 Air as the base model. KARL, without using any test-time compute, reaches the level of Claude Sonnet 4.5's high reasoning effort. When using 3 parallel rollouts, KARL surpasses Sonnet 4.6; when using 10 parallel rollouts, KARL matches the performance of the strongest model, Opus 4.6 (KARLBench total score 67.5 vs. 67.5), and 20 parallel rollouts further improve this to 68.1.

[Figure 1: Cost-Quality and Latency-Quality Pareto Frontiers] KARL defines the Pareto frontier on both cost and latency dimensions.

In terms of cost, a single call of KARL is the lowest among all models scoring above 55 (less than $0.10/query). When matching Opus 4.6 quality, KARL's cost is about 33% lower. More notably, KARL is even cheaper than its base model GLM 4.5 Air, while scoring more than 6 points higher—RL taught the model a more efficient search strategy, completing tasks with fewer steps and token overhead. In terms of latency, KARL's latency is about 47% lower when matching Opus 4.6.

What Exactly Did RL Teach the Model?

The paper conducts an in-depth analysis of the impact of RL training on model behavior. On the BrowseComp-Plus synthetic data, the trajectory length after RL training significantly shortened, with the average number of steps for solved problems dropping from 51.1 to 36.3. Meanwhile, the model's search diversity increased by 37% (cumulative retrieved unique documents).

[Figure 19: Search Efficiency Improvement] On 87 questions where all three models achieved perfect recall, RL training reduced unnecessary post-retrieval searches from 134.0 to 56.5, while accuracy improved from 53% to 71%.

Regarding the question of whether RL merely "sharpens" the base model's existing capabilities, the paper provides clear evidence: max@k improves with training iterations at all k values. The trained model's max@1 reached the level of the base model's max@8, and max@2 surpassed the base model's max@16—meaning the trained model can solve problems in two attempts that the base model couldn't solve in sixteen.

[Figure 10: Test-Time Compute Scaling] Training continuously improves Max@K rather than just improving Max@1, indicating that RL expands the model's problem-solving coverage.

X Says

Currently, agents use only a single tool, vector search, which can subsequently be extended to structured retrieval, code execution, and composable sub-agents. Context management currently relies on simple prompt compression, which can be further improved through more fine-grained hierarchical memory management. Additionally, in scenarios requiring numerical calculation, the model tends to continue searching for pre-calculated results rather than reasoning on existing evidence; this reasoning shortcoming needs to be addressed by introducing arithmetic and tabular reasoning rewards.

As the large model competition enters the agent era, KARL's results suggest an important direction: carefully designed synthetic data plus multi-task reinforcement learning may be more effective than simply scaling up model size in pushing the Pareto frontier of knowledge agents.

Original Title: KARL: Knowledge Agents via Reinforcement Learning
Original Link: https://arxiv.org/abs/2603.05218

KARL: Knowledge Agents based on Reinforcement Learning

Related Articles

分享網址