The Last Human-Written Paper? 37 Researchers from Stanford, MIT, Harvard & More Say It's Time to Ditch the PDF: A Four-Layer Executable Protocol Boosts Reproduction Accuracy to 93.7%

In a nutshell 👉🏻 Thirty-seven researchers from top institutions like Stanford, MIT, Harvard, and CMU have proposed the ARA (Agent-Native Research Artifact) protocol. It aims to replace traditional research PDFs with a four-layer, structured, and executable "research artifact" that allows AI agents to directly understand, reproduce, and extend scientific results—boosting question-answering accuracy from 72.4% to 93.7% and improving reproduction success rates by 7 percentage points.

What's the Problem with Traditional Papers?

The PDFs we scroll through on arXiv every day might seem like the standard carrier of scientific achievement, but in reality, they are a form of "lossy compression."

Traditional publishing compresses rich research objects into a lossy narrative (left); ARA preserves the original information as a high-fidelity, agent-executable knowledge package (right).

The paper identifies the problem as two structural costs inherent in traditional research publishing:

The Storytelling Tax: The research process is actually a branching tree—filled with failed experiments, rejected hypotheses, and abandoned exploration paths. But a paper compresses all of this into a single linear narrative, discarding every "wrong turn."

The Storytelling Tax: Research progresses as a branching tree (left), but is published as a linear narrative (right), discarding all failed knowledge.

In the RE-Bench dataset, failed experiments account for 90.2% of total costs (and 59.2% of token consumption), with a median failure-to-success token ratio of up to 113x. The moment a paper is published, all this valuable exploratory experience vanishes into thin air.

The Engineering Tax: A paper only needs to be written to a degree that "satisfies reviewers," but for an AI agent to reproduce the work, much more information is required.

An information gap analysis of 8,921 reproduction requirements from PaperBench. (a) PDFs systematically under-specify code development tasks. (b) The three main gap types align precisely with the categories covered by ARA's structured layers.

The data is brutal: Of 8,921 expert-annotated reproduction requirements in PaperBench, only 45.4% were adequately specified in the source PDFs. Information on code development was the most scarce, with only 37.3% meeting the standard. Missing hyperparameters alone accounted for 26.2% of all information gaps.

When the reader shifts from a human to an AI agent, these two taxes go from "tolerable" to a "fatal bottleneck."

The ARA Protocol: A Four-Layer Structured "Research Artifact"

To tackle this problem, the research team proposed ARA (Agent-Native Research Artifact)—a completely new scientific publishing format that replaces the linear narrative with a four-layer structure.

The ARA directory structure. Each file's function is annotated inline, with tier labels marking the four top-level divisions.

The Cognitive Layer (/logic): Understanding "What Was Done and Why"

This layer is no longer about "telling a story," but about machine-parseable research logic:

• problem.md: Defines the research gap and key insight.
• solution/: Specifies the architecture, algorithm, and key heuristics for convergence.
• claims.md: Distills falsifiable assertions with explicit pointers to evidence.
• experiments.md: Declares the validation plan.
• related_work.md: Transforms passive citations into typed dependency relationships.

The most elegant design here is related_work—no longer blocks of textual summary, but a machine-executable dependency graph. Import nodes inject prior definitions, Bound nodes propagate constraints into the hyperparameter search space, and Baseline nodes automatically trigger regression detection.

The Physical Layer (/src): Executable Code for the "How"

The physical layer offers two modes:

Kernel Mode: Suitable for algorithmic contributions. It keeps only core modules and typed I/O signatures, reducing code volume by 1-2 orders of magnitude compared to a full repository. A coding agent can regenerate environment-specific boilerplate code on demand.

Repository Mode: Suitable for systemic contributions (CUDA kernels, distributed training, etc.). It retains the complete implementation and maps source files to ARA components via an index.md manifest.

The configs/ directory annotates every hyperparameter with a rationale and search range; the environment.md file locks down dependencies, hardware, and random seeds.

The Exploration Graph (/trace): Preserving the Full Research DAG

This is ARA's most ambitious design—it preserves the entire exploration process that traditional papers discard, completely intact.

ARA's cross-layer structure. Claims in /logic link to code in /src and evidence in /evidence via forensic bindings. The exploration graph (bottom center) captures the research DAG, with dead-end nodes preserving failure modes and lessons learned.

The exploration graph is stored as a nested YAML tree and contains five node types: question, decision, experiment, dead_end, and pivot.

The dead_end node preserves a hypothesis, its failure mode, and the lesson learned—information a traditional paper would never tell you, but which is invaluable for subsequent researchers, whether human or AI.

The Evidence Layer (/evidence): Raw Outputs Supporting Every Claim

The evidence layer stores only output data:

• results/: Machine-readable metric tables and generated data.
• logs/: Training curves, resource usage, and diagnostic information.

There's a clever permission isolation design here: experimental logic (what to verify) is in /logic, while experimental data (exact results) is in /evidence. A verification agent can access the code and algorithm description, but the evidence layer is quarantined—this prevents an agent from falsifying reproduction results by simply copying the expected values.

Three Supporting Mechanisms

A protocol format isn't enough on its own. How do you naturally produce an ARA? How do you convert a backlog of existing papers? How do you review one? The paper designs three supporting mechanisms for this.

Live Research Manager: Silent Capture During Research

This is a background service running as an Agent Skill, which non-intrusively collects research trajectories during a researcher's normal development process.

The Live Research Manager runs at session boundaries: a three-stage pipeline (Context Harvester → Event Router → Maturity Tracker) distills researcher-agent dialogue into typed events, accumulating across layers over time.

The three-stage review pipeline:

1. Context Harvester: Scans session records (chat history, tool outputs, experimental results, code diffs) to extract research-significant events.
2. Event Router: Classifies each event, marks its source (user / ai-suggested / ai-executed / user-revised), and writes it to the corresponding ARA layer.
3. Maturity Tracker: Reviews the staging area and promotes observations with sufficient evidence to formal entries.

The entire system is stateless—the artifact itself carries the cross-session memory. A short summary is written at the end of each session; at the start of the next session, it reads the index and current claims, surfacing historical information only when relevant.

ARA Compiler: Converting Legacy PDFs and Code Repos into ARA

For the vast ocean of already-published papers, the ARA Compiler provides a "many-to-one" conversion channel—it accepts any combination of PDFs, code repositories, datasets, and human-annotated evaluation rubrics as input, and outputs a standard ARA format.

The ARA Compiler accepts various research sources and guides a coding agent through a four-stage top-down compilation, looping through ARA Seal Level 1 validation until the output is protocol-compliant.

The compilation process happens in four phases:

Phase 1: Semantic Deconstruction. Strips the narrative framework and rewrites the content in an information-dense, telegraphic style, eliminating the storytelling tax at the source.

Phase 2: Cognitive Mapping. Populates the /logic layer—the motivation chain (observation → gap → insight), falsifiable claims, formalized concepts, and solution structure.

Phase 3: Physical Instantiation. Generates the /src layer—annotated configurations, typed code stubs, and an environment manifest. If a code repository is available, stubs are replaced with the actual implementation, and a code-paper cross-check is performed to mine tacit knowledge (undocumented tricks, extra parameters, etc.).

Phase 4: Exploration Graph Extraction. Reconstructs the research DAG, with dead-end leaf nodes recording hypotheses, failure modes, and lessons.

After compilation, the system runs an ARA Seal Level 1 check in the same agent session, returning structured diagnostics that drive targeted repairs. The generate→validate→repair cycle typically converges in 2-3 rounds.

ARA-Native Review System: Three-Tier Verification + Three-Stage Pipeline

The core philosophy of this system is blunt: "Don't make a human do a machine's job."

The ARA Seal three-tier verification badge. Each tier tests a progressively stronger property of the artifact: structural integrity (seconds), argument rigor (minutes), and execution reproducibility (hours to days).

ARA Seal Level 1—Structural Integrity (seconds, deterministic)
Verifies the artifact's format specification: the directory structure exists, all structured files conform to their schemas, and all cross-layer references are resolvable.

ARA Seal Level 2—Argument Rigor (minutes, rubric-based agent)
The Rigor Auditor Agent assesses the artifact's epistemological soundness along six objective dimensions:
Three load-bearing dimensions: Evidence Relevance (does the cited experiment for each claim substantively verify its assertion?), Falsifiability Quality (are the criteria operational, non-tautological, and scope-matching?), and Methodological Rigor (baseline sufficiency, ablation coverage, statistical reporting, metric-claim alignment).
Three auxiliary dimensions: Scope Calibration, Argument Coherence, and Exploration Completeness.

ARA Seal Level 3—Execution Reproducibility (hours to days, sandboxed coding agent)
Selects key claims for a scaled-down, directional verification (small data, few epochs, toy configuration) to test if the claimed property holds qualitatively. The verification agent is isolated from the artifact's evidence layer—it only gets the code kernel and algorithm description, never seeing the reported numerical values.

The three-stage ARA-native review pipeline. After Stages 1-2 resolve mechanical and rigor issues using ARA Seals, human reviewers step in, redirecting expert attention to novelty and significance.

The actual operation of the three-stage review pipeline:

1. Proof-of-Concept (minutes): Levels 1+2 run automatically, producing a CI report. Authors iterate on structural fixes before moving to the next stage.
2. Empirical Validation (hours–days): Level 3 executes the reproducibility check, producing an empirical review report.
3. Human Review (days–weeks): Reviewers receive the reports from the first two stages. They no longer spend time on issues like "the code doesn't run" or "Table 3 contradicts Claim 2"—they only need to judge: Is the contribution important? Is the insight novel? Is the problem modeled correctly? Are there ethical risks?

The (Human+AI)² Research Network

On top of ARA, the paper paints an even bigger picture—a collaborative research network with ARA artifacts as the core objects.

The (Human+AI)² Research Network. Each researcher interacts with a shared ARA network via a Research Agent, and agents can also collaborate directly with one another.

Each researcher interacts with a shared ARA network through a Research Agent, using three operations: /submit, /retrieve, and /fork. Agents can also communicate directly—shifting science from "individual heroism" to "agent collective intelligence."

Experimental Evaluation: A Three-Layer Domination Over Traditional PDFs

The researchers evaluated their approach on three levels: understanding (can an agent extract knowledge?), reproduction (can an agent execute the research?), and extension (can an agent build upon previous work more efficiently?).

Knowledge Extraction: Accuracy Soars from 72.4% to 93.7%

In a test of 450 questions covering 30 targets, ARA led comprehensively across all categories:

• Surface Results & Methods (Category A): ARA 95.6% vs. Baseline 80.8%, while also consuming 12% fewer tokens.
• Configuration Recovery (Category B): ARA 92.6% vs. Baseline 67.8%.
• Failure Knowledge (Category C): ARA 81.4% vs. Baseline 15.7%—because traditional papers simply don't contain failure information.

The Category C comparison is the most telling: traditional papers never mention failed experiments, making it nearly impossible for an agent to acquire any failure-related knowledge from them (a 15.7% accuracy rate is barely better than guessing). ARA's exploration graph layer, however, preserves this information intact, sending accuracy up to 81.4%.

Reproduction Experiment: The ARA Advantage Grows with Difficulty

On 15 PaperBench papers with GitHub repositories, involving 10 reproduction tasks each (150 tasks total, 1,743 rubric requirements), ARA achieved a difficulty-weighted success rate of 64.4%, compared to 57.4% for the baseline, a 7 percentage point lead.

Aggregated reproduction success rates for 15 papers, stratified by difficulty. ARA's advantage increases monotonically with difficulty: +4.9% for easy, +5.6% for medium, and +8.5% for hard.

The most interesting finding is that ARA's advantage grows monotonically with task difficulty—+4.9% on easy tasks, +5.6% on medium, and +8.5% on hard. This is perfectly intuitive: harder reproduction tasks depend more heavily on configuration details that are inadequately specified in a PDF, and ARA fills in precisely this information.

The per-difficulty increment (in percentage points) for ARA vs. baseline for each paper, sorted by mean advantage. Green indicates ARA wins; red indicates baseline wins.

A per-paper analysis shows an 8 win / 5 tie / 2 loss record. The papers where ARA showed the greatest advantage featured multi-stage training pipelines—precisely the type where configuration information is most likely to be missing.

Extension Tasks: Failure Trajectories Accelerate Early Progress

On five open-ended extension tasks from RE-Bench, ARA's performance was more nuanced, revealing a "double-edged sword" effect:

Trajectories for five RE-Bench extension tasks on Claude Sonnet 4.6. One task per column: top row shows score vs. time, bottom row shows score vs. cost.

Key findings:

• Early Acceleration: On all five tasks, the ARA Agent reached its first useful progress faster than the Paper Agent.
• Late-Stage Reversal: On triton_cumsum and restricted_mlm, the Paper Agent eventually caught up and surpassed the ARA Agent.
• Base Model Influence: The same comparison yielded reversed results on the weaker Sonnet 4.5 model.

This suggests that preserved failure trajectories can accelerate an agent's early exploration (by helping it avoid past mistakes), but they may also constrain a powerful agent from searching for solutions beyond the recorded trajectory. When the agent is strong enough, "free exploration from scratch" can sometimes be superior.

This is a notable finding—too little information leads an agent down blind alleys, but too much information can "anchor" an agent to search near known solutions, paradoxically limiting the potential for breakthrough innovation.

A Taxonomy of Key Reproduction Information

The paper also provides a detailed "Taxonomy of Key Reproduction Information" in its appendix, quantifying the distribution of various information gaps:

• Combinatorial Experiment Matrix: 24.1%
• Evaluation Protocol: 18.5%
• Hyperparameters: 17.2%
• Metric Calculation and Logging: 10.4%
• Result Interpretation: 8.6%
• Architecture Specification: 5.8%
• Mathematical Formulas: 4.5%
• Implementation Tricks: 4.2%
• Data Pipeline: 3.8%
• Environment and Infrastructure: 2.9%

This taxonomy itself serves as a checklist for AI researchers—before publishing a paper, check against it to see if all this information has been clearly stated.

Why This Paper Is Worth Reading

This paper has drawn attention not just for proposing a new format, but because it points to an ongoing shift:

The consumer of research is changing from humans to agents. When models like GPT-4, Claude, and Gemini begin to participate in research—reading literature, designing experiments, writing code, analyzing results—the traditional paper, a format optimized for human reading, becomes a bottleneck.

The AI version of "publication bias" is being amplified. Human peer review already suffers from publication bias (favoring positive results over negative ones). When AI agents can only learn from papers, this bias is further amplified—an agent never knows which paths have already been proven dead ends.

The reproducibility crisis has a new solution. For decades, academia has discussed the reproducibility problem. The ARA framework offers a technical path forward: relying not on individual conscientiousness, but on protocol enforcement.

However, this framework has obvious limitations:
- Converting the massive backlog of existing papers into ARA format requires a huge investment in computational power.
- The conversion quality of the ARA Compiler depends on the information completeness of the source PDF.
- Extension experiments show that too much prior information may constrain the innovation space for powerful agents.
- Level 3 (execution reproducibility) in the three-tier review can require significant computational resources, which may limit its large-scale adoption.

These issues point to implementation details rather than a flawed direction, making them "good problems" to have.

Resource Links

📄 Paper Link
https://arxiv.org/abs/2604.24658

💻 Code Repository
https://github.com/Orchestra-Research/Agent-Native-Research-Artifact

🌐 Open Platform
https://www.orchestra-research.com/ara