The claim that 'RAG is dead' has been rampant this year, with titles like 'Long-Context Windows, Agent Rise, Is RAG Dead?' and 'The RAG Obituary: Killed by Agents.' New-generation Agent CLIs like Claude Code and Codex have also abandoned embedding, with officials directly admitting: no indexing, no vector libraries—LLM-driven Grep is sufficient. Is RAG truly unsuitable for today's agents? We conducted in-depth research on this question, and simultaneously decompiled the source code of cutting-edge solutions like Claude Code, ultimately forming this article to answer the core concern of whether RAG still has a place in the Agent era.
01
Claude Code's creator, Boris Cherny, has mentioned this surprising fact on multiple public occasions: Claude Code does not use RAG, embedding, or build indexes. Core search relies on LLM-driven Grep (Grep is a Unix text search tool that searches files line by line given a regular expression).
On X/Twitter, he was very direct:
"Early versions of Claude Code used RAG + a local vector db, but we found pretty quickly that agentic search generally works better."
He further elaborated in an interview with Pragmatic Engineer:
"Plain glob and grep, driven by the model, beat everything."
Anthropic's official Context Engineering blog also confirmed this architecture: Claude Code uses Grep and Glob to dynamically load code into context. This choice wasn't arbitrary. Boris mentioned in the Pragmatic Engineer interview that he observed at Meta that after the IDE's click-to-definition feature crashed, all engineers retreated to manually Grep code. However, he also admitted on the Latent Space podcast that the decision to abandon RAG was partially based on intuition. Despite Anthropic stating they don't use RAG, opting for agentic search + Grep, details on how Grep is specifically called, how the LLM decides what to search for, and what the tool-calling loop looks like were not publicly disclosed.
In March 2026, a leaked source code snapshot of the Claude Code CLI was made public. I took a look and found that, as Boris stated, there was indeed no implementation related to embedding, vector, or similarity search in the source code. But more interestingly, I observed the specific implementation methods of this zero-index content search mechanism.
This article, based on this source code and some industry practices, dissects Claude Code's code search mechanism and the design philosophy behind it from an implementation level. This article first dismantles how Claude Code drives a multi-turn Grep loop (Chapter 2), then examines why brute-force search is fast enough on local projects (Chapter 3), and finally compares it with industry solutions like Cursor to discuss the real costs and benefits of the Grep approach (Chapter 4). To keep the entire mechanism from staying abstract, we'll use an example throughout the article. Besides the terminal CLI mode, Claude Code also has web and desktop versions, which operate through a remote control system called 'bridge,' with a server-side session executing the actual work. Suppose you develop a curiosity while reading this source code: When the LLM calls GrepTool to search, how does the bridge track and record this tool call? You ask Claude Code to find the answer from the source code—what happens next?
02
Claude Code's code search can be summarized in one sentence: The LLM autonomously decides what to search for, which tools to use, and whether to continue searching after retrieving results, cycling until the information is sufficient. There's no preset search flow or fixed tool invocation order; everything is decided by the LLM at runtime. This chapter first explains how the search loop operates and which tools are available, then delves into the core GrepTool to see how it controls the amount of information returned. Finally, using the practical problem posed at the start, we'll walk through a complete multi-turn search process.
2.1 Search Loop and Tools
The core of the entire search mechanism is a loop: pass user input and the available tool list to the LLM; the LLM returns text or a tool call request. For a tool call, execute the tool and append the result to the conversation history, then call the LLM again with the updated full history. The LLM decides the next step based on the continuously growing context until it deems the information sufficient and directly generates a text answer without further tool calls, naturally ending the loop. The loop also has forced exit mechanisms: reaching the maximum turn limit, exceeding budget constraints, user interruption, or tool calls denied by permissions.
This loop treats all tools equally. The LLM can call any tool at any time, even multiple tools simultaneously in a single response. There is no hard-coded 'must search before reading.' There are four core tools related to code search:
Tool | Underlying Implementation | Function |
|---|---|---|
GrepTool | ripgrep ( | Regular expression search of file contents |
GlobTool | glob pattern matching | Find files by filename/path pattern |
FileReadTool | Node.js fs | Read a specified line range from a given file |
AgentTool | Independent LLM conversation | Launch a sub-agent for multi-step exploration |
Additionally, there are LSP (Language Server Protocol) tools that supplement Grep with semantically precise operations like "go to definition" and "find references." However, the core search architecture is built upon these four tools.
Among them, AgentTool is quite unique: it doesn't directly search files but launches an independent sub-agent, allowing the sub-agent to complete an entire search task within its own context window, only returning the conclusion to the main conversation. Sub-agents come in several types, and the most relevant to search is the Explore type: it is equipped only with search and read tools (Grep, Glob, Read), cannot edit files, execute commands, or nest new Agents—making it a pure read-only search expert.
The core value of a sub-agent is context isolation. It builds its own conversation history from scratch, not inheriting messages from the main conversation. This means the large volume of grep results and code snippets generated during its search process remain within its own context, and the main conversation only receives a summarized text conclusion. For tasks requiring broad searches, directly searching in the main conversation could fill its context with intermediate results after a few rounds of grep/read. By delegating to a sub-agent, the main conversation's context only increases by one conclusion message.
2.2 Information Volume Control in GrepTool
In general practice, the most common LLM pattern is 'locate first, then dive deep': use Grep/Glob to find relevant files, then use Read to view specific content. But after Grep finds a file, is it always necessary to follow up with a Read to make use of it? Not necessarily. The key is that GrepTool has three output modes, returning completely different amounts of information:
files_with_matches mode (default): Only returns a list of matching file paths, no code content. For example, searching for "class.*Transport" might return paths like
cli/transports/WebSocketTransport.ts,cli/transports/SSETransport.ts. The LLM only gets filenames, so this mode typically requires a subsequent Read to see the specific code. This is why the default mode is designed to return only filenames—deliberately controlling information volume to avoid flooding the context window with large amounts of code in one Grep, letting the LLM judge which files are worth deeper reading. There's also a protective mechanism:head_limitdefaults to 250, meaning even if 10,000 matches are found, only the first 250 are returned, preventing search results from overwhelming the context.content mode: Returns matching lines and their contextual code. For instance,
Grep({pattern: "TOOL_VERBS", output_mode: "content", "-C": 5})would directly return code snippets with 5 lines of context before and after each match. For many scenarios—like confirming a constant's value, viewing a function signature, or checking if an import exists—these snippets suffice, eliminating the need to Read the entire file.count mode: Only returns the number of matches per file, used to quickly assess the distribution density of a search term across a project, without returning specific content.
So, the actual tool combination is flexible: Grep (default mode) → Read is the most common path, but Grep (content mode) can be used independently. The LLM can also directly call Read (if it already knows the file path) or initiate multiple parallel Grep searches simultaneously. This flexibility is intentional. The design philosophy uses soft guidance instead of hard constraints: the system prompt suggests the LLM locate with Grep first and then dive deep with Read, and GrepTool's default output mode naturally guides this flow, but other paths aren't blocked in code, allowing the LLM to make judgments based on specific situations.
2.3 Practical: Tracking the Execution Records of the Search Tool
Let's return to the example from the opening. Of course, I deliberately chose this problem because the answer is scattered across multiple files, requiring a multi-turn search to piece together the full picture, aiming to demonstrate the complete multi-turn search process. I posed this question to Claude Code, and the following is the actual search process.
Round 1: Casting a wide net. The LLM translates the problem keywords GrepTool and tracking into grep patterns, sweeping through with the default files_with_matches mode:
Grep({pattern: "GrepTool|tool.*track|tool.*activity", glob: "*.ts"})→ Returns 4 files: structuredIO.ts, sessionRunner.ts, bridgeUI.ts, bridgeStatusUtil.ts
Four files, three under the bridge/ directory and one under cli/. As the question pertains to the bridge system, the LLM focuses on files under bridge/. sessionRunner.ts (session + runner) is most likely to contain tool execution tracking logic.
Round 2: Viewing Context. Grep switches to content mode to see the context of GrepTool within sessionRunner.ts:
Grep({pattern: "GrepTool|tool.*activity", path: "bridge/sessionRunner.ts", output_mode: "content", "-C": 5})
The returned code snippet reveals the tail end of a mapping table, showing entries like GrepTool: 'Searching', BashTool: 'Running', but the leading lines are truncated. The LLM judges that it needs to Read the entire code block to see the full table.
Round 3: Calling Read. Using Read to open the full context of sessionRunner.ts, revealing three key structures in one go:
The first is a tool-name-to-verb mapping table (
TOOL_VERBS, with 18 entries total; search-related ones are listed here). Each search tool (Grep, Glob) is mapped to 'Searching'. Note the dual naming conventions, like the internal name `Grep` and the external SDK name `GrepTool`. This indicates that tool names are hardcoded in the mapping table, not dynamically registered.Grep: 'Searching', GrepTool: 'Searching', Glob: 'Searching', GlobTool: 'Searching', Read: 'Reading', FileReadTool: 'Reading', Edit: 'Editing', FileEditTool: 'Editing', Bash: 'Running', BashTool: 'Running', // Also includes Write, MultiEdit, WebFetch, WebSearch, Task, NotebookEditTool, LSP, etc.The second is a summary generation function that concatenates a verb and the search target: the verb comes from the mapping table above, while the target is extracted from the tool call's input (prioritizing file_path, then pattern, command, url, etc.). So, a summary for a call like
GrepTool({pattern: "reconnect|backoff"})would be 'Searching reconnect|backoff'.The third is an activity parser: it parses JSON line by line from the session's stdout. When it discovers a tool call event, it calls the summary function mentioned above to generate a summary and packages it into an activity event.
At this point, we know how tracking and recording works, but where does the generated activity event go next?
Round 4: Tracking consumers. Grep searches for references to SessionActivity, tracing out the entire chain at once:
Grep({pattern: "SessionActivity|currentActivity", path: "bridge/", output_mode: "content", "-C": 2})
Three files surface simultaneously:
bridge/types.ts: Type definitions for activity events, containing only 3 fields (type, summary, timestamp). Each session maintains a ring buffer and a current activity pointer.bridge/bridgeMain.ts: A timer periodically polls each session's current activity and maintains a trail of the last 5 tool calls, e.g., a history likeSearching → Reading → Searching → Editing.bridge/bridgeUI.ts: Upon receiving a tool start event, it caches the summary text and renders it into the bridge's status panel.
This completes the full tracking chain: the session process outputs JSON for tool calls → the activity parser extracts and generates a summary → the bridge main process periodically polls to get the latest activity → the UI module renders it to the status panel.
03
The previous chapter demonstrated how Claude Code uses multi-turn Grep to search code. But this raises an obvious question: every round of Grep is a brute-force scan across project files. Wouldn't that be slow for a larger project with tens of thousands of files?
Grep today is a large family. The classic GNU grep, born in 1973, recursively iterates files, doesn't recognize .gitignore, and is single-threaded by default. But GrepTool under the hood of Claude Code doesn't use it. Instead, it uses ripgrep, a modern implementation rewritten in Rust by Andrew Gallant in 2016. It respects .gitignore by default, automatically skips binary files, runs multi-threaded, and accelerates matching with SIMD—designed from the ground up for fast searching in large codebases.
Source code evidence: in tools/GrepTool/GrepTool.ts:21, there's import { ripGrep } from '../../utils/ripgrep.js', so the real workhorse is ripgrep, not the system grep.
This chapter explains why ripgrep's brute-force scan is fast enough for a developer's local projects: how five layers of filtering shrink the search scope from tens of thousands of files to just a few dozen, how SIMD and Boyer-Moore accelerate in-file matching, and the fundamental difference in data scale between code search and vector retrieval. Section 3.4 also includes actual test data comparing ripgrep vs. GNU grep using Claude Code's own source code, so the difference can be seen directly.
3.1 Five Layers of Filtering in ripgrep
ripgrep doesn't perform regular expression matching on every file. Before actually searching content, it applies multiple layers of filtering to progressively narrow the scope:
Layer 1: Directory-level pruning (.gitignore) — skips entire subtree directories, without even reading directory contents
Layer 2: Path scope limitation (path parameter) — limits the starting point for directory traversal
Layer 3: File type filtering (glob parameter) — traverses directory but skips non-matching files
Layer 4: Binary file detection — reads the first few bytes of a file header, skips if binary
Layer 5: Content search (regex matching) — finally performs matching on files that passed all filters
The effect of each layer multiplies. Let's revisit our running example from the article, the 4th round search: Grep({pattern: "SessionActivity|currentActivity", path: "bridge/", glob: "*.ts"}). On the leaked Claude Code source code (4,471 files), the actual filter chain was:
Original file count: 4,471
Layer 1 .gitignore pruning: → 4,471 (source code snapshot has no node_modules, this layer had no effect)
Layer 2 path restriction bridge/: → 32 (only traverses the bridge/ directory)
Layer 3 glob *.ts filtering: → 32 (all files under bridge/ are .ts, this layer had no extra filtering)
Layer 4 binary detection: → 32 (all are text files)
Layer 5 regex matching: → 3 files matched (bridgeStatusUtil.ts, sessionRunner.ts, bridgeUI.ts)
In this example, the path restriction was the biggest filter, slashing the number from 4,471 to 32 in one step. But for a full Node.js project including node_modules/, .gitignore pruning would have an even larger effect. A typical node_modules/ directory can contain tens of thousands of files, and a single rule can cut a massive number of them.
3.2 Acceleration Methods for In-File Searching
For the files that actually need searching after filtering, ripgrep employs multiple optimizations at the content matching level:
SIMD vectorized matching. Under the hood, ripgrep uses Rust's regex crate, which leverages CPU SIMD instructions to compare bytes in parallel. Ordinary byte-by-byte comparison processes 1 byte at a time, while AVX2 processes 32 bytes at once. During a search, SIMD is first used to rapidly scan for the occurrence of the search string's first character, only performing a full match on a hit. For multi-pattern searches, ripgrep utilizes the Teddy algorithm to achieve SIMD-level multi-pattern parallel matching.
Boyer-Moore hopping. For fixed-string searches, comparison starts from the end of the pattern. When a mismatch occurs, it directly skips multiple characters based on a bad character table. For long patterns, only about n/m characters need to be scanned (n = file size, m = pattern length).
Operating System Page Cache. Content of read files is cached in memory by the OS. For frequently used code projects, files are almost always in the cache. The first search might trigger disk I/O, but the second search returns directly from memory.
mmap zero-copy. For large files, ripgrep uses mmap (memory mapping) instead of the usual read() system call. Ordinary read() requires copying data from kernel space to user space; mmap allows the process to directly access the kernel's Page Cache, saving one data copy. For small files, it's not worth it due to system call overhead, so ripgrep dynamically chooses based on file size.
Multi-threaded parallelism. ripgrep uses a thread pool to process multiple files in parallel: one thread traverses the directory tree to produce file paths, multiple worker threads search different files in parallel, and results are aggregated via a lock-free queue.
3.3 Performance Benchmark Data
Using Claude Code's own source code (4,500 files, 950,000 lines of code) for a real-world test, comparing the time taken by ripgrep and GNU grep to search for the same keyword on the same machine (stable values from 3 runs):
Search Pattern | ripgrep | GNU grep -r | Multiplier |
|---|---|---|---|
| 0.09s | 2.55s | 28x |
| 0.10s | 3.30s | 33x |
| 0.10s | 2.45s | 25x |
The file scopes searched by both are nearly identical (ripgrep 4,494 files vs. GNU grep 4,522). The gap mainly comes from ripgrep's multi-threading and SIMD acceleration, not from file filtering. A search latency of 0.1 seconds is virtually imperceptible for interactive use.
3.4 Finite Data Scale
A very important reason is that the data scale Claude Code confronts (a developer's local project) falls squarely within the feasible range for brute-force search.
Vector Retrieval | Grep | |
|---|---|---|
Typical Data Volume | GB ~ | MB ~ hundreds of MB |
Single Comparison Cost | 768 float multiplications (cosine similarity) | 1 byte equality check |
After SIMD Acceleration | ~24 multiplications/instruction | ~32 comparisons/instruction |
Total Brute-Force Scan Time | seconds ~ minutes | tens of milliseconds |
With a 250MB codebase, when the Page Cache is hit (which is almost always the case for a developer's daily projects), even disk I/O is eliminated; the entire data set sits in memory. Based on a modern dev machine's memory bandwidth of roughly 30GB/s, the lower bound for data transfer to shift this 250MB from the page cache is approximately 250MB / 30GB/s ≈ 8 milliseconds. In real-world conditions, the CPU overhead for ripgrep's SIMD pattern matching itself must also be added, but actual measured total time usually falls in the tens to low hundreds of milliseconds. Therefore, there is no necessity to build an index.
04
Claude Code opts for zero-indexing, but not everyone in the industry thinks this way. This chapter compares Claude Code with Cursor and Codex: what Cursor's dual-index architecture looks like, why Codex made a choice almost identical to Claude Code's but via a different implementation path, how scale determines architectural choices, and Claude Code's own direction of evolution. It then addresses the most common criticism of the Grep approach, namely the token cost issue, examining what mechanisms Claude Code uses in its source code to control costs.
4.1 Cursor's Dual-Index Architecture
Cursor uses a classic RAG architecture, layered atop a trigram index. This section only introduces Cursor's indexing part, which distinguishes it from Claude Code, because Cursor also has a Grep search tool available (see section 4.2).
Semantic Index: Locally, code is chunked by syntax boundaries using tree-sitter, incrementally synced via a Merkle Tree (only transmitting changed parts). Code chunks are encrypted and uploaded to Cursor's server, which uses an embedding model to generate vectors and then immediately discards the original code. Vectors and metadata are stored in Turbopuffer (a vector search engine). The search flow is: user query → embedding → vector k-nearest neighbor search → top-K → reranking → assembled into context.
Exact Search Index: Cursor developed Instant Grep in 2025-2026, using a trigram (3-character combination) inverted index to accelerate grep searches. During preprocessing, file contents are split into 3-character sliding windows (e.g., "OAuth" → "OAu", "Aut", "uth"), and a file list containing that trigram is maintained for each one. At search time, the intersection of the file lists for all trigrams in the search term is taken to get candidate files, and only these candidates undergo regex matching.
In summary, Cursor takes a preprocessing route: code repositories are chunked, embedded in the background, vectors written to Turbopuffer, and also fed into a Merkle tree for incremental sync maintenance. The offline-built index is the prerequisite for the entire chain.
Claude Code takes an on-demand route: no index, no preprocessing. The LLM decides in real time within a conversation what keywords to grep and which files to read; all semantic understanding is performed by the model itself within the loop. These two architectures represent two sets of tradeoffs: indexing buys hit rate and cross-repository scalability, while zero-indexing buys zero startup time, zero maintenance, and zero friction with the developer workflow.
4.2 Scale Determines Architecture
Cursor's indexing scale itself illustrates the point. Turbopuffer's official customer case study disclosed Cursor's vector infrastructure data: 10 billion+ vectors, 10 million+ namespaces (each corresponding to one user's codebase), write throughput of roughly 10GB/s. CTO Sualeh Asif called Turbopuffer "one of the few parts of our infrastructure we don't have to worry about scaling." This scale means Cursor deals with more than just small personal projects. When a codebase is large enough, the latency of brute-force Grep becomes unacceptable. For an Agent scenario, search latency directly determines how many search turns can be made within a finite time, thus determining the depth of the agent's understanding of the code.
So, zero-index and dual-index are not a matter of technical superiority, but a matter of scenario choice. Claude Code targets local developer projects (MB to hundreds of MB), where ripgrep's brute-force scan takes only tens of milliseconds. Coupled with the LLM's reasoning ability, zero-indexing means zero startup delay, zero maintenance cost, and zero configuration. Cursor targets a wider range of scenarios, including large codebases, where brute-force scan latency is unacceptable, making an index necessary.
But one point worth noting is that in Cursor's leaked Agent system prompt from March 2025, grep_search is explicitly marked as the MAIN exploration tool. The LLM is instructed to first Grep with a broad set of keywords, while codebase_search (semantic search) is only used as a supplement for "conceptual queries." A company that treats semantic search as a core selling point and built an entire vector infrastructure for it internally places Grep as the first tool to call. This suggests that for the task of code search, 'finding known symbols via exact match' is far more frequent and deterministic than 'finding similar concepts via semantic understanding.' Vector retrieval solves the long tail that Grep cannot cover, not the other way around. Industry trends also confirm this; analysis indicates Cursor is de-emphasizing pure vector search and shifting towards hybrid search, while Claude Code has pushed this route to the extreme—completely forgoing semantic retrieval, relying on the LLM to translate semantic needs into precise keywords for Grep.
It is worth noting that Claude Code itself is evolving. Version v2.0.74 introduced LSP (Language Server Protocol) support, using semantically precise operations like "go to definition" to replace some Grep + multi-file read operations, which practically reduced token consumption by about 40%. The community is also making supplements: someone developed the Beacon plugin, using Claude Code's built-in PreToolUse hooks to intercept Grep calls and replace them with hybrid search (vector + BM25 + rank fusion).
4.3 Validation from Codex: Zero-Index Arrived at via a Different Path
Earlier, we contrasted the different choices of Cursor (dual-index) and Claude Code (zero-index). But there is another important reference point: OpenAI's Codex CLI.
Codex's code search architecture is strikingly similar to Claude Code's: likewise no indexing, no embedding, no vector database. A community-submitted feature request for a vector index was closed by the OpenAI team, explicitly stating it's "not currently on our roadmap."
But there is a key divergence in their implementation paths: the design of the code search tools differs. Claude Code encapsulates special-purpose tools for search operations. GrepTool has three output modes and parameters like head_limit, GlobTool does filename matching, and FileReadTool reads line ranges. Each tool has a clear parameter schema, and the LLM uses them through structured tool calls. Codex has no dedicated search tool. Its core tools are shell (for executing arbitrary shell commands) and apply_patch (for editing files in a specific diff format), plus others like update_plan, view_image, web_search, and spawn_agent (for multi-agent collaboration). All code search operations are done via the shell tool. The LLM can directly combine Unix commands like rg, find, cat, and git to search.
Codex's multiple system prompt files contain the same instruction: "When searching for text or files, prefer using rg or rg --files respectively because rg is much faster than alternatives like grep." The search pattern is likewise multi-turn iteration: Grep → read file fragments → adjust keywords → search again.
Aspect | Claude Code | Codex CLI |
|---|---|---|
Search Tool | Dedicated tools (GrepTool, Glob, Read) with structured parameters | Uses |
Indexing | None | None |
Sub-agents | Built-in (Explore, Plan types with context isolation) | Built-in ( |
Editing Method | Edit (string replacement) | apply_patch (diff format) |
The core divergence of the two paths lies in the level of encapsulation for the search tool. Claude Code wraps Grep into a dedicated tool with structured parameters, so the LLM doesn't need to parse raw shell output, reducing error probability and making it easier for the system to control information volume. Codex lets the model directly write shell commands to call rg, granting maximum flexibility (free combination of pipes, regex, path filtering), but requiring the model to handle unstructured text output itself.
What is truly noteworthy is the consensus: two competing AI coding products independently arrived at almost the same architectural decision—using an LLM to drive ripgrep and abandoning vector retrieval. This is unlikely to be a coincidence. It indicates that within the current LLM capability level and the scale scope of local developer projects, zero-index + Grep is already a repeatedly validated and effective solution.
4.4 The Cost Problem of the Grep Approach and Countermeasures
Why choose Grep over vector retrieval? Two core reasons: no need to pre-build an index, as Grep directly searches real-time files on disk each time, resulting in zero startup delay, zero maintenance costs, and no stale index problem. Additionally, the core need of code search is precise matching, where Grep is more reliable than semantic similarity. But this brings an obvious cost problem: aren't multiple rounds of Grep and Read calls a massive token burner? After all, each turn of the search loop sends the complete conversation history to the Claude API, and the context keeps growing.
Engineers from the vector database company Milvus (Zilliz) once published an article titled "Why I'm Against Claude Code's Grep-Only Retrieval? It Just Burns Too Many Tokens" directly questioning this. The article showcased a real test case: debugging a bug in a VSCode extension using Claude Code. Grep repeatedly searched the repository, dumping large amounts of irrelevant text, ultimately taking 14 tool calls, 32.2k tokens, and 59.3 seconds to find the answer, when the correct 10 lines of code were actually buried in 500 lines of noise. The article summarized the problem into three points: token bloat (every Grep shovels massive irrelevant code into context, cost worsening with repository size), time tax (AI asks the codebase twenty questions while the developer waits), and zero semantics (Grep only does literal matching, understanding neither code meaning nor relationships). As an alternative, they open-sourced the MCP plugin Claude Context based on vector retrieval, claiming about a 40% reduction in token consumption and about a 36% reduction in tool call count on the same task.
So how does Claude Code itself deal with context bloat? From the source code, at least three layers of mechanisms exist (it should be noted that all three are general engineering techniques that embedding solutions can also use; they are not exclusive advantages of grep):
Layer 1: Prompt caching to reduce repeat billing. The API recognizes that the prefix of the current request's input is exactly the same as the previous one, because only the latest round of tool results is appended at the end. This allows reusing the existing computation cache, paying full price only for the newly added incremental part, while the previously accumulated bulk is billed at about 1/10th the cost (cache rate). The source code shows Claude Code has done meticulous engineering optimization for this: the system prompt is split into multiple independent text blocks before sending, each with individually marked caching policies. This block design ensures unchanged parts can precisely hit the cache, without being invalidated by dynamic content changes. An analysis by Vadim in December 2025 found that 92% of the prompt prefix is identical between adjacent turns in an agentic loop, cutting actual costs by about 81%.
Layer 2: Auto-compaction to compress history. Multiple rounds of grep/read cause the conversation history to grow continuously. When the accumulated token count approaches the context window limit, Claude Code automatically triggers conversation compaction: using an LLM to generate a summary of the old conversation history, then replacing the original messages with the summary to directly shorten the history. This means context doesn't grow infinitely; the grep results and read content from early search turns are eventually compressed into a summary, freeing up space for subsequent searches.
Layer 3: Sub-agents to isolate search results. The Explore sub-agent mentioned in Chapter 2 is itself a context management tool. The mass of raw grep/read results is processed and digested in the sub-agent's independent context, with only the refined conclusion returned to the main conversation, preventing it from being swamped by intermediate search results.
These three mechanisms make brute-force multi-turn search manageable in practice, but they don't eliminate the gap in single-retrieval precision between the Grep approach and the embedding approach. The core tradeoff of the Grep approach is: exchanging more search turns and larger context overhead for the engineering simplicity of zero-indexing, zero-maintenance, and zero startup delay. This tradeoff is cost-effective at the scale of a local developer's project, but whether it still holds at a larger scale depends on the growth curve of search turns and context costs.
4.5 Grep's Effectiveness Boundary: Code vs. Natural Language
Milvus's criticism targeted token overhead in general scenarios. But in the specific scenario of code search, Grep's performance may be much better than intuition suggests. A systematic study (GrepRAG: An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion, ISSTA '26) conducted a rigorous comparison on two code benchmark datasets, CrossCodeEval and RepoEval_Updated: letting an LLM autonomously generate ripgrep commands to retrieve code context, then using the retrieved content for code completion. The results found that even the simplest single-turn Grep retrieval outperformed the embedding-based RAG baseline in code completion: On the RepoEval_Updated Python Line completion task, Naive GrepRAG's Exact Match reached 38.61%, whereas Vanilla RAG (BM25 + embedding) was only 24.99%. The paper analyzed the reason for grep's success in code scenarios: 95% of code search keywords are identifiers: class names (36%), method names (41%), variable names (18%). Identifiers are themselves the semantics of the code, and exact matching happens to be the most direct retrieval method. This is unlike natural language search, where "vocabulary mismatch" is the norm, but in code, getUserById is just getUserById; it won't be paraphrased into fetchPersonByIdentifier.
However, this all hinges on one premise: the task is code search. Once the scenario switches to something like natural language Q&A, the conclusion becomes much more complex. Zach Nussbaum conducted a real-world test in "On the Lost Nuance of Grep vs. Semantic Search": on the Natural Questions dataset (a natural language Q&A dataset), he directly used ripgrep (rg -i -c, treating the query after removing stop words as a set of keywords to search for). The initial results were poor because of the often severe vocabulary mismatch between the Q&A query and the answer documents; the query uses one concept, the document uses synonyms or paraphrases. Interestingly, just adding one step—LLM query expansion, using a cheap small model (gpt-5-mini) to first rewrite the query into a set of more relevant keywords, and then feeding these keywords to Grep—boosted recall about 5-10 times. In other words, the semantic understanding from the LLM on the query side could produce an effect similar to embedding-based semantic search. But even with this step, Grep still couldn't catch up to embedding's semantic matching capability: when users only remember a side characteristic of a concept but can't think of appropriate keywords, vector retrieval remains the more suitable tool.
Function names and class names in code are precision anchors deliberately planted by programmers, which Grep can almost always hit; a single concept in natural language can have a dozen different expressions, requiring an LLM or embedding to help translate the concept into possible keywords. Claude Code's choice to abandon embedding is not because vector retrieval itself is ineffective, but because code search happens to be the niche where Grep excels.
It should be noted that both the GrepRAG paper and the above test on the NQ dataset used single-turn retrieval, without an iterative process of "see unsatisfying results → change keywords and search again." Meanwhile, Claude Code's search loop is multi-turn iterative (like the four-turn search in the Chapter 2 practical example). Multi-turn iteration can theoretically further improve results (adjusting search direction based on intermediate findings), but it also means more context overhead, which is precisely the pain point of Milvus's criticism. How much improvement multi-turn grep can bring over single-turn, and what context cost it incurs, currently lacks direct experimental data.
05
Let's return to the opening question: Is RAG really dead? To answer this, we must first ask something more fundamental: What exactly do we mean by RAG? By its original definition of Retrieval-Augmented Generation, it refers to a very broad paradigm: first retrieve relevant content, then stuff the retrieved results into the context, and finally let the model generate an answer based on that content. By this definition, what Claude Code does completely fits RAG. It just swapped the retrieval layer from embedding + vector database to LLM-driven grep and glob; the entire "retrieval → context → generation" skeleton remains unchanged.
But the 'RAG' that has been repeatedly declared dead over the past year actually refers to something much narrower: pre-chunking code, embedding it, writing it into a vector database, and running a k-nearest-neighbor search on a user's query, feeding the top-K results to the model. This is the most common implementation form of RAG, but it is only one implementation of the paradigm. So the more accurate statement is: it's not that RAG is dead, but that the specific way of doing RAG—"pre-building an index, static one-shot retrieval"—is being replaced in certain scenarios.
This article has explained above why this substitution can happen in the specific scenario of code search, summarized in three reasons:
Code is inherently Grep-friendly. Function names, class names, and constants in code are essentially high-precision anchors planted by programmers, and exact matching is the most direct retrieval method by nature. The GrepRAG paper verified this on benchmarks like CrossCodeEval: single-turn grep-driven retrieval could surpass the embedding RAG baseline. This also explains why even Cursor, a company that treats semantic indexing as a core selling point, internally labels
grep_searchas the 'main exploration tool' in their system prompt.The scale of a developer's local project can withstand brute-force scanning. ripgrep finishes scanning a 4,500-file project in just 0.1 seconds; at this order of magnitude, an offline index is completely unnecessary. The premise "brute-force search is slow" applies when data is too large for the brute-force algorithm to run, and most local codebases are still several orders of magnitude away from that threshold.
Agents bring about a shift in retrieval patterns. Traditional RAG is passive: the system pre-decides "what you might need to see" before the question arises, doing a one-time retrieval of a batch of relevant chunks to stuff into the context, and the model can only reason over this given set of contents. In contrast, retrieval in the Agent era is proactive: the model actively decides at each turn what it currently needs, which tool to get it with, and whether to keep searching after obtaining results. The four-turn practical search in Chapter 2 is the concrete form of proactive search; what to search for at each step is determined by the discovery of the previous step, a path that no pre-retrieval could ever guess. Under this scenario, the potential of Grep can be fully realized; for instance, in the experiment of Section 4.5, after using an LLM to rewrite the query, the accuracy of just a single-turn search improved by 5-10 times.
This is what is truly happening behind the batch of headlines claiming 'RAG is dead': what is dying is not the retrieval-augmented generation paradigm itself, but the default assumption that code search must rely on embedding pre-indexing. Claude Code and Codex both arrived at a zero-index choice via different paths, indicating that in the realm of code search, using an LLM to drive Grep is already a good enough, or even less cumbersome, alternative. What about beyond this scope? In scenarios dominated by soft semantics like natural language Q&A, embedding remains a significant part, and at larger scales of code repositories, indexes cannot be discarded. In conclusion, the choice of technology is dictated by the characteristics and scale of the data, not a matter of faith.
References
Official Public Information from Claude Code:
Boris Cherny (Creator of Claude Code), X/Twitter post: "Early versions of Claude Code used RAG + a local vector db, but we found pretty quickly that agentic search generally works better." https://x.com/bcherny/status/2017824286489383315
Boris Cherny, Latent Space podcast: Claude Code: Anthropic's Agent in Your Terminal: "This was just vibes, so internal vibes. There's some internal benchmarks also, but mostly vibes." https://www.latent.space/p/claude-code
Boris Cherny, Pragmatic Engineer interview: Building Claude Code: "Plain glob and grep, driven by the model, beat everything." Also mentioned observing engineers at Meta retreating to manual grep after IDE crashes. https://newsletter.pragmaticengineer.com/p/building-claude-code-with-boris-cherny
Cat Wu (Anthropic engineer), Every podcast interview https://every.to/podcast/transcript-how-to-use-claude-code-like-the-people-who-built-it
Anthropic Official Blog, Effective Context Engineering for AI Agents https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Boris Cherny, Hacker News comment: "Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for." https://news.ycombinator.com/item?id=43164253
Academic Research:
GrepRAG: An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion (ISSTA '26): Systematically compares grep retrieval vs. embedding/graph RAG on CrossCodeEval and RepoEval_Updated, proving that LLM-driven single-turn grep outperforms traditional RAG baselines on code completion tasks. https://arxiv.org/abs/2601.23254
Community Analysis and Discussion:
"Claude Code Doesn't Index Your Codebase. Here's What It Does Instead.": LMCache 92% prompt reuse rate analysis, Explore sub-agent's Haiku model choice, semantic gap case of function renaming. https://vadim.blog/claude-code-no-indexing
Zhihu: Claude Code LSP reduces token consumption: v2.0.74 LSP support with actual testing data https://zhuanlan.zhihu.com/p/1993974927498433157
Milvus Blog: Against grep-only retrieval: critique from a vector database vendor's perspective https://milvus.io/zh/blog/why-im-against-claude-codes-grep-only-retrieval-it-just-burns-too-many-tokens.md
Beacon Plugin: Community practice of intercepting grep with hooks to replace it with hybrid search https://dev.to/sagarmk/how-i-built-a-claude-code-plugin-that-intercepts-grep-and-replaces-it-with-semantic-search-500h
Claude Code source code analysis: Based on a leaked snapshot of the Claude Code CLI source code from March 31, 2026.
Cursor-related: Engineer's Codex: How Cursor Indexes Codebases Fast, Cursor Agent system prompt (leaked version from March 2025), Turbopuffer Customer Story: Cursor. https://read.engineerscodex.com/p/how-cursor-indexes-codebases-fast https://turbopuffer.com/customers/cursor
OpenAI Codex-related:
Codex Prompting Guide: system prompt advises "prefer using rg..." https://developers.openai.com/cookbook/examples/gpt-5/codex_prompting_guide
Unrolling the Codex Agent Loop: detailed five-stage agent loop explanation https://openai.com/index/unrolling-the-codex-agent-loop/
GitHub Issue #609: request for vector indexing feature was closed by the OpenAI team, "not currently on our roadmap" https://github.com/openai/codex/issues/609