BAAI ArXiv CLI Open-Sourced! 200M+ Papers Become Skill Packs for Research Agents

Author | BAAI (Beijing Academy of Artificial Intelligence)

DeepXiv is a scientific literature infrastructure designed specifically for AI agents, transforming paper search, progressive reading, trend tracking, and deep research into callable, orchestratable, and automatable capabilities. It does not merely port paper websites to the command line; instead, it converts scientific literature itself into data interfaces and skill systems that agents can directly consume. DeepXiv is jointly developed by BAAI, universities, and community developers. The project is now open-source and free to use.

Resource Links:
GitHub: https://github.com/DeepXiv/deepxiv_sdk
PyPI: https://pypi.org/project/deepxiv-sdk/
API Docs: https://data.rag.ac.cn/api/docs
Technical Report: https://arxiv.org/abs/2603.00084

Introduction

With the rapid development of large model agents, AI-driven Autonomous Research is moving quickly from concept to reality. From automatically discovering scientific problems and generating research plans to designing theoretical methods and conducting experimental explorations, research agents are fundamentally reshaping the paradigm of scientific research across the entire process.

However, to truly serve scientific research, a fundamental technical bottleneck must be addressed: How can agents efficiently utilize scientific literature?

BAAI has identified this core pain point early on: today, the utilization of scientific literature is still designed for human users. In traditional models, agents must rely on cumbersome internet searches and web parsing to retrieve relevant papers, and further require complex reading tools to extract valid information from highly visualized PDFs.

This infrastructure, based on Search Engines and Graphical User Interfaces (GUI), is highly incompatible with the working methods of agents, severely constraining their effectiveness and execution efficiency.

In other words, while we possess massive amounts of open scientific literature, we lack a "scientific literature infrastructure" oriented towards agents.

If papers in the past were merely "for humans to read," now they need to cater to the new demand of being "for agents to read."

An effective approach is: Make papers into a CLI (Command Line Interface), allowing agents to easily access and utilize them.

Therefore, BAAI, in collaboration with universities and the open-source community, has tackled this challenge by proposing the core concept of adapting papers for CLI interaction and building dedicated literature infrastructure. This bridges the gap between massive open papers and agents, laying a solid foundational base for automated research.

DeepXiv

DeepXiv is a comprehensive toolkit for scientific literature oriented towards agents. Its goal is to upgrade open scientific literature from "human-readable" to "agent-usable."

To achieve this, DeepXiv provides three core capabilities.

Data Access: Turning Open Scientific Literature into "Data Consumable by Agents"

DeepXiv can access data formats friendly to agents, with native support for JSON and Markdown. Paper data becomes directly readable and usable, freeing agents from the need to "struggle to scrape information" from complex PDF and HTML files. Furthermore, agents can directly access metadata such as titles, authors, abstracts, and references, making paper utilization more convenient.

Simultaneously, for agents, the real test is not just how to obtain information, but how to utilize it precisely within limited context windows and reasoning budgets. Addressing this, DeepXiv offers data organization methods optimized for agents. At the Preview level, DeepXiv quickly retrieves core paper information to judge relevance at low cost. Through Chunking, it splits paper content by structure or semantics to support intensive reading of specific parts. During the overall reading process, DeepXiv implements Progressive Disclosure: viewing a small amount first, then expanding as needed, avoiding the injection of entire long texts at once.

The value of these designs is direct: reducing token consumption, improving retrieval and reading efficiency, and supporting complex multi-step research tasks, allowing agents to focus on truly valuable information.

This is not merely a conceptual design but one that can be directly implemented in specific invocation methods. Facing a new research topic, the most natural action for an agent is not to read the entire paper at once, but to first search for candidate literature, quickly judge whether it is worth investing more context budget, and finally expand only the truly critical parts. For example:

pip install deepxiv-sdk                      # Install the toolkit
deepxiv search "agent memory"                # Search research topic
deepxiv paper 2602.16493 --brief             # Quickly view abstract and key points
deepxiv paper 2602.16493 --head              # View structure and chapter distribution
deepxiv paper 2602.16493 --section "Experiments" # Read only the Experiments section

This set of commands corresponds to a literature utilization path very close to the real research process:

search finds candidate papers first.
--brief previews core paper information, using extremely low cost to judge paper value.
--head helps the agent grasp the full-text structure and chapter distribution.
--section allows the Agent to read only the most valuable content on demand, such as Introduction, Method, or Experiments. The result is not simply "reading less," but empowering the agent with the ability to allocate token budgets according to information value.

The paper content returned by DeepXiv is in parsed Markdown or JSON format, making it stress-free for Agents to read! Below are examples of return content for the --brief and --head commands.

# Example return for command: deepxiv paper 2602.16493 --brief
📄 MMA: Multimodal Memory Agent
🆔 arXiv: 2602.16493
📅 Published: 2026-02-18T00:00:00
📊 Citations: 0
🔗 PDF: https://arxiv.org/pdf/2602.16493
💻 GitHub: https://github.com/AIGeeksGroup/MMA
🏷️ Keywords: memory-level reliability, temporal decay, conflict-aware consensus, epistemic prudence, visual placebo effect
💡 TLDR:
[research paper] MMA introduces a memory-level reliability framework that dynamically scores retrieved items using source credibility, temporal decay, and conflict-aware network consensus to mitigate overconfidence from stale or inconsistent memories. It reveals the 'Visual Placebo Effect'—where RAG agents generate unwarranted certainty from ambiguous visual inputs due to latent biases in foundation models—and demonstrates superior performance on FEVER (35.2% lower variance), LoCoMo (higher actionable accuracy, fewer wrong answers), and MMA-Bench (41.18% Type-B accuracy vs. 0.0% baseline) under epistemic-aware evaluation protocols that reward abstention and penalize overconfidence.

// Example return for command: deepxiv paper 2602.16493 --head
{
  "arxiv_id": "2602.16493",
  "title": "MMA: Multimodal Memory Agent",
  "abstract": "Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the \"Visual Placebo Effect\", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.",
  "authors": [
    {
      "misc": {},
      "name": "Yihao Lu",
      "orgs": ["School of Computer Science, Peking University"]
    }
    // ...
  ],
  "token_count": 17386,
  // ...
  "sections": [
    {
      "name": "Introduction",
      "idx": 0,
      "tldr": "MMA introduces a memory-level confidence scoring framework that uses source credibility, temporal decay, and conflict-aware consensus to prioritize reliable memories and prevent retrieval traps, while introducing an incentive-aligned benchmark that rewards epistemic prudence and calibrated abstention.",
      "token_count": 1098
    }
    // ...
  ],
  "categories": ["cs.CV"],
  "publish_at": "2026-02-18T00:00:00",
  "keywords": ["memory-level reliability", "temporal decay", "conflict-aware consensus"],
  "tldr": "[research paper] MMA introduces a memory-level reliability framework...",
  "github_url": "https://github.com/AIGeeksGroup/MMA"
}

DeepXiv already covers the full volume of ArXiv data and maintains daily incremental updates.

Meanwhile, DeepXiv is rapidly expanding to more open literature sources, including PubMed Central (PMC), ACM, bioRxiv / medRxiv / ChemRxiv, and Semantic Scholar, ultimately establishing a unified agent access layer covering over 200 million open scientific papers.

This expansion will not stop at "data ingestion" but will continue to provide services externally using a unified, agent-oriented approach. For example, in the PMC scenario, agents can similarly retrieve paper content directly through similar commands:

deepxiv pmc PMC544940 --head       # View full text structure
deepxiv pmc PMC544940              # View full text JSON

This means that as more open literature sources are connected, agents will not face a set of fragmented interfaces with different invocation methods. Instead, they will encounter a reusable, transferable, and automatable way of utilizing literature. In other words, whether it is ArXiv, PMC, or more *Rxiv and OA data sources in the future, capabilities will be opened to agents as consistently as possible.

One-Stop Capability Integration: Not Just Retrieval, But "Helping Agents Do Things"

DeepXiv has built its own dedicated paper search engine, providing optimized search results and configurable search modes. Of course, simply "searching out" papers is far from enough. Based on search capabilities, DeepXiv has further developed richer skills: In terms of Q&A capabilities, DeepXiv can directly complete information extraction and understanding 围绕 literature, such as "What is the core contribution of the paper?" or "What are the experimental settings and comparison baselines?", achieving deep understanding of literature. Simultaneously, DeepXiv can track hotspots, understanding which papers on a certain topic are hot daily, weekly, or monthly. When facing complex questions, DeepXiv will conduct in-depth research, such as "What are the representative works on Agent Memory in the past three years?" or "What are the public benchmarks and datasets for multimodal retrieval augmentation in financial scenarios?"

The DeepXiv skill pack is continuously expanding, and agents can perceive and flexibly invoke these through its built-in Skills and the command line --help mechanism.

This characteristic of "not just retrieval, but invoking capabilities around tasks" becomes more obvious in actual use. For instance, a typical hotspot tracking process can be as simple as this:

# (Example logic flow)
# 1. Fetch the hottest paper pool from the last week
# 2. Quickly preview single paper content
# 3. Add social media propagation heat
# 4. Agent continues to summarize, filter, sort, and generate weekly reports

If the task is to enter a new research topic, the process can also be very direct:

deepxiv search "agentic memory" --limit 20      # Search topic-related papers
deepxiv paper 2506.07398 --head                 # View full text structure
deepxiv paper 2506.07398 --section Experiments   # Intensively read key chapters

First find candidate papers, then view the structure, and finally read only the most critical experimental sections. If necessary, the agent can also continue to invoke internet search to supplement general Web information, or obtain paper metadata based on the Semantic Scholar database. In other words, DeepXiv does not provide isolated commands, but a set of research task capabilities that can be continuously invoked by agents.

deepxiv wsearch "agent memory"      # Invoke internet search
deepxiv sc 161990727                # Get Semantic Scholar metadata

If you wish to further condense these capabilities into a deliverable task, DeepXiv also has a built-in Deep Research Agent. It can chain search, filtering, progressive reading, information extraction, and summarization into a complete link, sparing users from manually stitching every step. For example, developers can directly ask it "What are the latest representative works on Agent Memory?" or "What are the noteworthy multimodal retrieval augmentation papers from the past year?" This makes DeepXiv not only capable of providing underlying commands but also able to directly undertake some high-level research tasks. Of course, users can also encapsulate DeepXiv directly into Skills and inject them into any Agent to quickly start Agent research work.

pip install "deepxiv-sdk[all]"      # Install full tool dependencies
deepxiv agent config                # Configure API key
deepxiv agent query "What are the latest papers about agent memory?" --verbose # Start deep research

Rich Access Forms: Adapting to Full-Scenario Needs from Agents to Developers

DeepXiv does not limit itself to a single-point tool but provides multiple access forms to meet multi-layer needs from agents to developers.

First, CLI is the core form of DeepXiv. Through the command line, agents can seamlessly access all capabilities such as literature search, paper acquisition, and paper utilization, and achieve more complex workflows by orchestrating running scripts.

deepxiv search "agent memory" --date-from 2026-03-02 --limit 50 --format json
deepxiv search "agentic memory" --date-from 2026-03-02 --limit 50 --format json
deepxiv search "memory agents long-horizon" --date-from 2026-03-02 --limit 50 --format json

Secondly, DeepXiv also provides MCP (Model Context Protocol) access capabilities, meaning you can embed DeepXiv into various agent development frameworks, making "scientific literature utilization" a standard tool for agents.

Furthermore, for developers who need deeply customized workflows, DeepXiv also provides a Python SDK, enabling flexible integration into highly customized research agents.

More importantly, based on deepxiv, developers can very quickly encapsulate a batch of customized Skills oriented towards specific research tasks. For example, automatically tracking new papers in a certain direction every week, automatically filtering works with open-source code, batch extracting experimental settings and results, generating baseline tables for a certain topic, or even continuously maintaining a dynamic knowledge base for a research direction. This means DeepXiv is not just providing a "callable tool," but is providing a reusable and sustainable capability base for daily research workflows.

Practical Demonstration: Letting Codex Organize Agent Memory Papers from the Last 30 Days

If the capabilities described above define what DeepXiv can do, what truly reflects its value is how it strings these capabilities together in a real task.

The following demo corresponds to a very typical and high-frequency research requirement:

"Help me organize papers related to agent memory from the last month, see what datasets they ran on, how effective they were, and if there is open source."

This task looks like "find a few papers and summarize," but in reality, it usually involves a whole chain: first determine the time range (last month), then search around the topic, handle noise results, and preview candidate papers one by one to filter out works that are only keyword-related but not thematically aligned. After finding truly relevant papers, continue to view structures and experimental chapters, and extract key information such as benchmarks, metrics, scores, and code links, finally organizing them into a deliverable, editable Markdown baseline table.

Without data and tool support oriented towards agents, this process often means switching back and forth between web pages, flipping through PDFs, copy-pasting, and manually organizing into tables. In the DeepXiv workflow, this matter can be broken down into a set of very natural actions.

Step 1: Search Candidate Papers by Topic and Time Range

First, the agent will perform multiple synonym searches around the user's topic, rather than betting on a single query:

The benefit of doing this is that the agent can recall as many candidate papers as possible first, and then narrow down the scope step-by-step in subsequent steps with lower cost.

In this step, it can quickly find highly relevant papers like AdaMem, All-Mem, D-MEM, Memex(RL), AndroTMem, LMEB, etc., while also identifying some results that just touch on keywords but do not belong to the main line of agent memory.

Step 2: Use `--brief` for Low-Cost Filtering

For the search results, there is no need to read the whole text at once. A more reasonable approach is to preview first:

deepxiv paper 2603.16496 --brief
deepxiv paper 2603.19595 --brief
deepxiv paper 2603.14597 --brief
deepxiv paper 2603.18429 --brief

--brief will extract the most critical information first, such as title, time, TL;DR, keywords, and GitHub links. For agents, the value of this step is huge, as it can complete the first round of judgment with extremely low token costs: for example, "Is this paper actually doing agent memory?", "Is it a method paper, a benchmark paper, or more system/governance architecture oriented?", "Is there a GitHub, is it worth prioritizing for further reading?"

It is also at this layer that the agent can quickly split candidate papers into primary and secondary sets, avoiding wasting budgets on a pile of marginally relevant results.

Step 3: Use `--head` to View Structure, Then Read Only Experiment-Related Sections

After filtering out truly relevant papers, the next step is not to "feed the whole text," but to look at the structure first, then read specifically:

# (Repeated command example for context)
deepxiv paper 2603.16496 --brief
deepxiv paper 2603.19595 --brief
deepxiv paper 2603.14597 --brief
deepxiv paper 2603.18429 --brief

This step corresponds to a process very similar to human researchers: for instance, a human researcher will first see what chapters this paper has, confirm what the experimental part is called, and then only expand sections like Experiments, Results, and Evaluation which truly have benchmarks and scores. If needed, they will supplement reading the dataset or experimental settings in the Appendix.

For example, in this task, the agent extracted a lot of directly comparable information from the experimental chapters:

AdaMem was evaluated on LoCoMo and PERSONAMEM, with LoCoMo reaching up to 44.65 F1, and PERSONAMEM average accuracy of 63.25%.
AndroTMem proposed AndroTMem-Bench and compared three history representations: raw history, summary, and ASM. For instance, Gemini-3-Flash under ASM can reach AMS 59.03 / TCR 65.05.
Memex(RL) on the modified ALFWorld improved task success rate from 24.22% to 85.61%.
Trajectory-Informed Memory Generation on AppWorld pulled the SGC of held-out scenarios from 50.0 to 64.3.
LMEB, as a benchmark, summarized 22 datasets and 193 zero-shot retrieval tasks.

In other words, what DeepXiv provides here is not simply "giving out the paper content," but allowing agents to consume literature in a way of "coarse screening first, then structural positioning, and finally 定点 intensive reading."

Step 4: Automatically Generate Markdown Baseline Table

Once papers, datasets, metrics, scores, and open-source status are all extracted, the final step is to organize them into structured deliverables.

In this demo, the agent finally wrote the results into a Markdown table, including: paper title and arXiv link, whether it is open source, code address, which benchmarks/datasets were run, what metrics were used, core results and comparable scores, and brief remarks on the paper's positioning.

This step is crucial because it means DeepXiv is not serving a one-off Q&A, but a research asset that can be continued to be reused: you can directly rewrite the Markdown file into a research document, slides, weekly report, or use it as a baseline starting point for subsequent projects.

This skill has been placed in the project and can be used directly! For example, copy it to the ~/.codex/skills/ directory to invoke it directly in Codex.

What This Demo Really Shows

The truly interesting thing about this example is that it is not a "show-off" task, but a very daily, very real research action.

For researchers, "what work has come out in this direction in the last month, what datasets were run, how was the effect, and is there open source" is inherently a high-frequency demand. DeepXiv 首次 completes this task in a way that truly fits the agent workflow: its search is structured, requiring no web parsing; preview is low-cost, requiring no full-text reading; reading is progressive, expanding only key chapters; extracted results are oriented towards tables and downstream tasks, rather than staying at natural language summaries; the final output is more savable, reusable, and expandable, becoming an intermediate product in the research process.

This is precisely the core problem DeepXiv wants to solve: not merely "moving papers to the command line," but truly turning papers into first-class objects that agents can call, filter, read, analyze, and deliver.

If traditional paper websites serve "humans clicking pages and reading themselves," then DeepXiv serves "agents actively invoking literature capabilities around research tasks and completing delivery."

BAAI ArXiv CLI Open-Sourced! 200M+ Papers Become Skill Packs for Research Agents

Introduction

DeepXiv

Data Access: Turning Open Scientific Literature into "Data Consumable by Agents"

One-Stop Capability Integration: Not Just Retrieval, But "Helping Agents Do Things"

Rich Access Forms: Adapting to Full-Scenario Needs from Agents to Developers

Practical Demonstration: Letting Codex Organize Agent Memory Papers from the Last 30 Days

Step 1: Search Candidate Papers by Topic and Time Range

Step 2: Use `--brief` for Low-Cost Filtering

Step 3: Use `--head` to View Structure, Then Read Only Experiment-Related Sections

Step 4: Automatically Generate Markdown Baseline Table

What This Demo Really Shows

Related Articles

分享網址

BAAI ArXiv CLI Open-Sourced! 200M+ Papers Become Skill Packs for Research Agents

Introduction

DeepXiv

Data Access: Turning Open Scientific Literature into "Data Consumable by Agents"

One-Stop Capability Integration: Not Just Retrieval, But "Helping Agents Do Things"

Rich Access Forms: Adapting to Full-Scenario Needs from Agents to Developers

Practical Demonstration: Letting Codex Organize Agent Memory Papers from the Last 30 Days

Step 1: Search Candidate Papers by Topic and Time Range

Step 2: Use --brief for Low-Cost Filtering

Step 3: Use --head to View Structure, Then Read Only Experiment-Related Sections

Step 4: Automatically Generate Markdown Baseline Table

What This Demo Really Shows

Related Articles

分享網址

Step 2: Use `--brief` for Low-Cost Filtering

Step 3: Use `--head` to View Structure, Then Read Only Experiment-Related Sections