Zero Index, Zero Embedding, Pure Grep: DCI Does Deep Research Directly on Raw Corpora

Recently browsing arXiv, I stumbled upon a rather interesting paper: "Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction," authored by researchers from Texas A&M, Waterloo, UCSD, Stanford, UIUC, along with companies like Verdent AI and Lambda.

The title pretty much gives away the core idea—stop obsessing over retrievers, it's time to rethink the very act of searching. After reading it, I found it quite inspiring. Let's break this paper down piece by piece.

Paper link: https://arxiv.org/pdf/2605.05242

Figure 1: Pareto frontier of performance vs. cost on BrowseComp-Plus. The blue star is the Qwen3-Embed-8B retriever approach; the green stars are the two DCI-Agent variants from this paper. Notice in the upper-right high-value region, DCI's approach crushes traditional retrievers—11 points higher accuracy, while saving $424 in costs.

McLuhan's Old Adage Gets Dusted Off Again

The paper opens by quoting Marshall McLuhan: "The medium shapes and controls the scale and form of human association and action."

Translated into the context of agent-based retrieval, this becomes: How an agent "sees" a corpus determines what it can and cannot do.

We're all familiar with the traditional RAG pipeline—chunk documents, build an index, run BM25 or dense embedding, toss in a query, and fetch the top-k results. This paradigm works fine for static large corpora and single-turn QA, and it's efficient enough. But the authors argue, where's the problem? The problem is that agents are getting dramatically more capable.

In deep research tasks within new benchmarks like Search-R1, ASearcher, and BrowseComp-Plus, agents can already independently plan, rewrite queries, and conduct multi-turn searches. But no matter how much they tinker, they're always separated from the corpus by a retriever layer, only seeing compressed top-k snippets at each step. This is awkward—the agent has ideas, but its "eyes" are severely restricted.

Consider several scenarios where retrievers have always struggled: What if you need an exact string match? What if you need to combine multiple weak signals into a composite query? What if you discover a new lead and want to immediately verify it within the original text? These tasks are awkward to perform through a traditional retrieval interface; much useful evidence gets filtered out before reaching the top-k, and no amount of downstream reasoning can salvage it.

So, Why Not Just Let the Agent Grep?

The authors' approach is elegantly brute-force—since the retriever is the bottleneck, let's just remove that layer entirely. Let the agent directly use universal terminal tools like bash, grep, find, and cat, diving straight into the raw corpus.

This approach is named Direct Corpus Interaction (DCI).

Figure 2: Left is the traditional retriever-mediated model: an index is built offline, the agent submits queries and gets top-k results. Right is the DCI model: the agent directly uses commands like grep, glob, and bash on the raw corpus—no embedding model, no vector index, nothing, just a shell.

Sounds counterintuitive, right? It's 2026, and we're back to playing with grep? But if you think about it, it actually makes plenty of sense:

Coding agents over the past couple of years have become incredibly proficient with CLI-based operations—projects like SWE-agent, Agentless, Claude Code, OpenHands, and Aider have all proven one thing: the combo of grep + read + bash is sufficient for a reasonably strong model to precisely locate, edit code, and run tests within a repository. If it can do that for code, why not for document search?

The benefits of the DCI paradigm are also quite direct:

No offline indexing overhead—dump a corpus in and you can start searching immediately.
Natively adapts to dynamic corpora—changes to a file are instantly searchable without rebuilding the index.
High interface resolution—the agent can apply stacked constraints like grep 'foo' file | grep 'bar', or use grep -n 'keyword' file | head to see exact locations and context.
Semantic understanding shifts from the index to the LLM itself—improvements in the model directly translate to gains.

Two Agent Implementations: One Lightweight, One Fully Armed

For controlled experiments, the authors built two DCI implementations:

DCI-Agent-Lite is a minimalist version, adapted from a lightweight terminal coding agent called Pi, equipped with only two tools: bash and read. The base model is GPT-5.4 nano with reasoning effort maxed out to high. This version's goal is to strictly isolate the "interface change" variable—no retrieval-specific modules, no embeddings, no reranker, purely a shell.

DCI-Agent-CC takes Claude Code directly as the harness, swapping in Claude Sonnet 4.6 as the base model with medium reasoning. This version aims to probe the performance ceiling—adding stronger prompts, more robust tool orchestration, and built-in context management. Note, however, it is still fundamentally DCI, touching no retriever interface whatsoever.

Both agents are given a max turn budget of 300, allowing ample room for exploration.

How to Avoid Context Explosion on Long Trajectories?

An unavoidable question for DCI is—one grep search could yield hundreds or thousands of matches, one cat command on a file could be tens of thousands of tokens, and after dozens of search rounds, the context window would explode.

Figure 3: Three runtime context management strategies—truncation, compaction, summarization.

The authors equipped DCI-Agent-Lite with a lightweight set of runtime context management, stacking three mechanisms:

Truncation is the simplest—if any tool call's output exceeds a certain character limit, it's chopped, but a trace that "this call occurred" is retained.

Compaction requires no LLM calls; it's a pure in-memory operation. When accumulated tool output surpasses a threshold, the tool results of older turns are replaced with short placeholders, but the structural skeleton of the tool call themselves remains.

Summarization is the heavy intervention—when context pressure is still too high, a summarization agent condenses the history into a brief summary, while the most recent turns are preserved in their original form.

These three mechanisms combine into five strategy levels, from L0 to L4, ranging from none to fully engaged. L0 does nothing; L1 only truncates to 50K characters; L2 truncates to 20K; L3 adds compaction; L4 adds summarization on top.

Evaluate Beyond Accuracy: Coverage and Localization

Looking at answer accuracy alone can't fully articulate the difference between DCI and traditional retrieval. So the authors introduced two trajectory-level metrics.

Coverage: Does the trajectory surface the gold documents? Measured via three scopes—any (at least one found), mean (average number found, i.e., recall), all (all found). This is a "breadth" metric.

Localization: After surfacing gold documents, can it precisely locate the key evidence snippets within the document? This is a "depth" metric, reflecting the ability to "drill down after reaching the key document."

Put simply, coverage measures "did you reach it," and localization measures "can you refine once reached." Combined, these two metrics bring the differences between DCI and retrievers into sharp relief.

Experimental Results: Pummeling the Retriever

The authors tested three categories of benchmarks: BrowseComp-Plus (agentic search), 6 multi-hop QA datasets (NQ/Trivia/Bamboogle/HotpotQA/2Wiki/MuSiQue), and 6 IR ranking datasets (4 BRIGHT + 2 BEIR).

On Agentic Search, using Claude Sonnet 4.6 as the base model, swapping out the Qwen3-Embedding-8B retriever for DCI boosted accuracy from 69.0% to 80.0% (+11 points) while reducing cost from $1,440 to $1,016 (-29.4%). DCI-Agent-CC also directly surpassed all retrieval baselines, including the strongest GPT-5 + Qwen3-Embedding-8B combo (71.7%), by 8.3 points. The lightweight DCI-Agent-Lite achieved 62.9% accuracy for just $93—trading blows with o3 + Qwen3-Embedding-8B (66%) but saving $647 in costs.

On Multi-hop QA, DCI-Agent-CC averaged 83.0% accuracy, a full 30.7 points higher than the strongest retrieval agent baseline, ASearcher-Local-14B (52.3%). The most staggering case was on the hardest dataset, MuSiQue, where DCI-Agent-CC scored 74% against ASearcher's 24%—a 50-point gap. HotpotQA saw a 30-point lead, and 2Wiki saw a 26-point lead. The lightweight DCI-Agent-Lite also achieved 68%, a solid second place.

On IR Ranking, things were even more dramatic—one would assume this is a retriever's home turf. Yet DCI-Agent-CC claimed victory on all 6 datasets, averaging an NDCG@10 of 68.5, a whopping 21.5 points higher than the strongest competitor, ReasonRank-32B (47.0). The Lite version scored 56.7, also 9.7 points above ReasonRank-32B.

Where Exactly Does DCI Win?

This is the paper's most fascinating section—the authors devoted significant space to controlled variables to pinpoint exactly where the gains originate.

Performance breakdown and tool distribution

Figure 4: Left is the comparison of DCI-Agent-CC vs. traditional retriever on all 830 questions in BrowseComp-Plus. Right is the distribution of DCI's tool calls—Bash accounts for 62.4%, Grep for 33%, and the rest are miscellaneous. Bash calls are further broken down by intent: chain search, document peek, regex, etc.

Let's first examine the counterintuitive finding from Research Question 1 (RQ1): DCI doesn't win because it retrieves more gold documents.

On a 100-question subset, Qwen3-Embedding-8B's mean coverage was 56.7%, while DCI-Agent-Lite achieved only 28.0%. Yet coverage_any (at least one gold doc found) was nearly tied (74.0 vs 70.0), while the localization score for DCI was 48.4 against the retriever's 21.7—a difference of 26.7 points.

This tells the story. Most BrowseComp-Plus questions have only 1-4 gold documents. Once DCI latches onto a useful document, it can immediately switch gears: from casting a wide net to drilling deep locally. It abandons the obsession of reeling back the entire gold chain, focusing instead on "can I extract more value from the document I've already reached."

The authors dubbed this phenomenon retrieval interface resolution—the retriever provides "document-level" or "paragraph-level" resolution, whereas DCI can provide "character-level" resolution. The agent can precisely lock onto a line, a sentence, or even the context around a token, then initiate the next search based on that.

The tool usage distribution corroborates this. Among Bash commands, chain search (grep piping into grep) accounts for 22.3%, document peek (head/tail/sed for local views) for 18%, regex search for 17%, single-keyword grep for 14.1%, and file finding for 14%—all are "refinement" type operations. Complete cat of a file only constitutes 9.2%. The agent's core loop is: combining constraints, precise matching, snippet verification, and reading on demand.

Not All Scenarios Benefit Equally

The authors were also honest, conducting several stress tests to delineate boundaries.

RQ4 – The Corpus Size Hurdle: The authors scaled BrowseComp-Plus's corpus from 100K documents to 200K (injecting distractor documents from FineWeb), and then to 400K.

Figure 5: DCI-Agent-CC performance across different corpus sizes—optimal value is at 100K; at 200K, tool calls jump from 38.5 to 86.9, accuracy drops 13.6 points; at 400K, it falls apart, accuracy plummets to 37.5%, averaging 122.4 tool calls per task, with 20 questions hitting the budget limit.

This conclusion is critical—DCI scales well in search depth but incurs steep costs for search breadth. Once the agent finds a good "anchor document," subsequent operations are highly efficient; but the cost of finding that first anchor skyrockets as the candidate space expands. Thus, traditional dense/sparse retrieval still holds irreplaceable value for massive static corpora.

RQ5 – How Much Does Context Management Strategy Matter? They ran comparisons across the L0 to L4 strategies. The conclusion is somewhat counterintuitive—more aggressive management isn't necessarily better, exhibiting a distinctly non-monotonic curve. L1 is fastest and retains the most gold evidence (31.3), but L3 achieves the highest accuracy (77). L2 has the lowest cost but the worst accuracy (69), and adding summarization in L4 actually caused a performance dip.

What does this imply? You have to forget the right things. Complete retention of all evidence ≠ retention of good working state. Multi-step hypothesis revision requires "selective forgetting." Too weak a compression causes the agent to drift; too harsh a compression destroys useful intermediate structures. Finding the sweet spot is key.

RQ6 – How Much Does Tool Expressiveness Contribute? This ablation study is brutal. The authors stripped DCI-Agent-Lite's tools down to just read + grep (denying even bash pipes) to see if it could still compete.

The result is—it holds its own. read + grep scored 61% accuracy, still 16 points higher than the Qwen3-Embedding-8B retriever (45%), with a similar number of tool calls. The full bash toolkit added another 12 points, but at the cost of doubling tool usage, latency, and compute.

Therefore, the core gain comes from the interface change itself, not the sophistication of bash. A minimal toolset can leverage the majority of the improvement.

Snooping Through the Code: A Thin Python Layer Wrapped Around Pi

The RQ6 finding that "a minimal toolset suffices" prompts the question: what does that minimal toolset look like in practice? Reading the paper wasn't enough, so I cloned the repository (github.com/DCI-Agent/DCI-Agent-Lite) and dug in. It turns out the engineering implementation is much "lighter" than expected—87% Python, 13% Shell, and a clean directory structure.

Overall Architecture: DCI-Agent-Lite ≈ Pi + Context Management Patch + Eval Scaffolding

First, a counterintuitive point—DCI-Agent-Lite has very little custom Python code; it's fundamentally a glue layer. The actual agent kernel is Pi, a minimalist terminal coding agent developed by the Earendil team (written in TypeScript, installed via npm).

The repository directory structure roughly looks like this:

DCI-Agent-Lite/
├── src/dci/             # Python CLI wrapper layer (dci-agent-lite entry point)
├── prompts/             # Task templates and evaluation prompts
├── scripts/             # Data download + benchmark run scripts
├── setup.sh             # One-click environment setup
└── pyproject.toml       # uv manages Python dependencies

Note that a step within setup.sh clones the codex/context-management-ablation branch from jdf-prog/pi-mono (a fork by first author Dongfu Jiang), then runs npm run build. In other words, the authors forked Pi and patched it for "context management ablation"—the L0 to L4 strategies in the paper are precisely what this patch implements. The Python code on the Lite side primarily handles: parameter passing, spawning the Pi process, collecting output, and saving trajectories.

Why Choose Pi as the Base?

Pi's design philosophy is a match made in heaven for DCI—Pi's website slogan, "There are many agent harnesses, but this one is yours," says it all. Let's look at a few key attributes:

Minimalist system prompt. Pi's default system prompt is shockingly short; none of that "You are a helpful assistant" boilerplate, leaving plenty of token room. This is crucial for long-trajectory deep research scenarios.

Few native tools, bash is core. Unlike Claude Code, which comes loaded with Read/Grep/Glob/Edit/Task tools, Pi only gives you a bash and a read. The rest is on you to cobble together shell commands. This perfectly aligns with what the paper's RQ6 wanted to verify—a minimal toolset is sufficient.

Built-in compaction mechanism. Pi by default auto-summarizes older messages into short text when approaching context limits. The authors extended this mechanism in their fork into the tunable L0-L4 strategies.

No frills. No sub-agents, no plan mode, no MCP—these are all things Pi deliberately "cuts," relying on extensions to add them. This stripped-down style made it easy for the authors to run controlled experiments: ensuring performance differences weren't noise introduced by the harness.

What Does the Actual Workflow Look Like?

The minimal runnable command looks like this:

uv run dci-agent-lite \
  --provider openai \
  --model gpt-5.4-nano \
  --cwd "corpus/wiki_corpus" \
  --extra-arg="--thinking high" \
  --extra-arg="--context-management-level level3" \
  "Using the wiki_dump.jsonl in the current directory, answer: Which street did the Great Fire of London originate on? Use rg instead of grep."

Let's break down the key parameters:

--cwd "corpus/wiki_corpus" is crucial—the agent's working directory IS the corpus directory. When Pi launches, bash's pwd points directly to the folder containing wiki_dump.jsonl. All subsequent rg, find, cat operations run right here, essentially making the model a "resident" of the corpus. This design perfectly matches the paper's phrasing: "agent operates directly within the environment it is reasoning over."

--extra-arg="--thinking high" passes through to Pi, then to OpenAI's reasoning effort. The combination of cheap GPT-5.4 nano with high thinking effort is the key recipe for this lite version's performance.

--extra-arg="--context-management-level level3" is the switch for the five strategy tiers from Table 1 in the paper; level3 is the default (truncation + compaction, no summarization).

That instruction in the user prompt—"Use rg instead of grep for fast searching"—seems like boilerplate, but it's an engineering detail. rg (ripgrep) is an order of magnitude faster than grep, practically a necessity for corpora with millions of documents. setup.sh also specifically installs ripgrep.

How Are Trajectories Saved?

After a run, the agent serializes the entire search process into outputs/runs/<timestamp>/:

question.txt — The original question.
final.txt — The final answer.
conversation_full.json — Complete dialogue history, containing every turn's thought, bash command, and tool result.

This JSON is the raw material for the paper's trajectory analysis—the tool distribution chart in RQ2 (grep 33%, bash chain search 22.3%, etc.) is parsed directly from this kind of JSON file. This also explains how the paper could dissect agent behavior at such a fine granularity.

How is the Context Management Layer Actually Implemented?

This is the most interesting engineering aspect. The paper diagrams three mechanisms (truncation/compaction/summarization), but how specifically do they hook into an agent loop like Pi's? Based on Pi's extension mechanism (extensions can "inject messages before each turn, filter message history") and the authors' patch, we can roughly infer:

Truncation happens before the tool result is written into the message history—once bash finishes, the raw output could be tens of thousands of characters. The truncation logic, based on the level's threshold (L1 is 50K, L2-L4 are 20K), chops off the excess; the model sees the truncated version. This has the biggest impact on token consumption.

Compaction is an "in-memory, zero-LLM operation"—purely structural, no model calls. Once the accumulated tool output exceeds 240K characters (L3 threshold), old tool results, except for the most recent 12 turns, are replaced with <Result_Placeholder>, but the structure of the tool call itself ("used grep xxx") is preserved. This way, the agent knows what it did without carrying the original, already-digested text.

Summarization is a heavy operation enabled only at L4—after compaction, if estimated tokens still exceed the threshold, an LLM is called once to compress the compacted history into a brief summary; the most recent 20K tokens are kept in their original form. The paper also notes a detail: if summarization fails three times consecutively, it gives up to prevent an infinite loop.

These three layers, stacked from light to heavy and combined with the non-monotonic finding in RQ5 (L3 optimal, L1/L4 next, L2 worst), essentially clarify one thing—what to forget, when to forget, and how to forget is itself an engineering problem, unsolvable by naive intuition like "the more aggressive the compression, the better."

Why Can This Thing Achieve 62.9%?

Piecing the above together, DCI-Agent-Lite manages to beat a GPT-5.2 + retriever setup on BrowseComp-Plus using the cheap GPT-5.4 nano. I believe this results from a combination of factors:

One, Pi's system prompt is extremely concise, wasting no tokens on "self-introduction." Two, the working directory points directly at the corpus; the agent doesn't need to spend effort learning a custom retrieval API—it already knows bash. Three, context management reins in the explosion issue on long trajectories, making it possible to not blow through the 300-turn budget. Four, GPT-5.4 nano's high reasoning effort plus ripgrep's speed makes the "search-and-think" loop runnable.

One final point that particularly impressed me is—this thing can genuinely be run directly on the documents on your own computer. The repository's README line, "Your private deep-research assistant," isn't marketing fluff—you don't need to upload documents to any cloud service, install a vector DB, or wait hours for embedding jobs. uv run dci-agent-lite --cwd ~/my-papers/ and you're off. This "out-of-the-box" appeal is incredibly strong for personal knowledge base scenarios.

My Takeaways

After reading the paper and going through the code, I feel its true value isn't any specific number, but reframing "retrieval" as an interface design problem, not just a retriever design problem.

The past decade of retrieval research has largely been a model arms race—better sparse algorithms, stronger dense embeddings, smarter rerankers. But if the agent itself can already think like a researcher—propose hypotheses, verify strings, read context, modify queries—then compressing everything into a similarity vector layer becomes the bottleneck. Giving it a higher-resolution interface and letting it handle semantics itself might yield better overall performance.

Of course, this paradigm has clear boundaries:

Massive static corpora (tens of millions of documents and up) are still dense retriever territory.
It places high demands on the base model's capability; weaker models can't withstand the complex search of long trajectories.
The cost structure shifts—from a "one-time fixed index-building cost" to "marginal cost per search," which may not be cost-effective for high-QPS services.
Evaluation metrics need an update—recall@k alone is no longer sufficient; trajectory-level localization is needed.

But conversely, in local, heterogeneous, dynamically changing agentic workspaces—like a developer's local codebase, continuously updated internal enterprise documents, scattered PDFs on a researcher's computer—the DCI paradigm genuinely has a competitive edge. No index to build, files are instantly searchable once changed, and the agent operates directly within the environment it's reasoning about. The experience will feel incredibly smooth.

Furthermore, looking at Anthropic's push this past year on GitHub with Claude Code and agent skills, the entire agent toolchain is moving in this direction—placing the LLM directly into the shell, into the filesystem, into real environments. From this perspective, DCI seems almost like an inevitable outcome.

A few directions I think are worth following next: how to blend DCI with traditional retrieval (e.g., rough filtering via dense recall, then grep for refinement); how to add a caching layer to DCI to reduce repeated search costs; and whether lighter base models (the 7B/14B class) can also reap the benefits of this paradigm.

The repo currently has 9 stars and 1 fork. The paper was just posted on arXiv last week (2605.05242), and the entire project is still very nascent. But I think this line of thinking is worth tracking—its engineering barrier is so low that almost anyone can reproduce it, run it on their own PDFs, their Notion exports, and see the results for themselves. Far more fun than just reading the paper.