Grep Beats Vector Search in Inline Agent Retrieval

Grep outperforms vector search across ten harness-model configurations when results are delivered inline, according to a factorial experiment by PwC researchers published May 14, 2026. The study, "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search," tested four agent harnesses and two retrieval modes on 116 questions from the LongMemEval-S benchmark. Authors Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah are the first to simultaneously vary harness, retrieval mode, and tool-output delivery path on the same dataset.

The corpus: 116 LongMemEval-S questions spanning six categories including temporal reasoning, knowledge-update tracking, and multi-session aggregation. Researchers paired raw dialogue with structured subject-verb-object event tuples carrying resolved datetime ranges. The custom harness, Chronos, runs on LangChain with dynamic category-conditioned prompting; it seeds each episode with top-15 vector results before entering a tool loop. Provider-native harnesses — Anthropic Claude Code, OpenAI Codex CLI, and Google Gemini CLI — received bash wrappers for grep and vector search. Models tested: Claude Opus 4.6 and Haiku 4.5, GPT-5.4, Gemini 3.1 Pro and Flash-Lite. A fixed GPT-4o judge scored all answers.

Inline delivery: grep wins across all ten harness-model pairs. The margin ranges from 1.7 percentage points (Claude Code plus Claude Opus: 76.7% grep versus 75.0% vector) to 23.3 points (Chronos plus Gemini 3.1 Flash-Lite: 86.2% versus 62.9%). Best inline grep scores reached 93.1%, achieved by both Chronos plus Claude Opus 4.6 and Codex CLI plus GPT-5.4. Chronos spans 83.6–93.1% across all backbones with inline grep; inline vector spans 62.9–83.6%. The reason: LongMemEval answers are typically licensed by literal spans — exact dates, counts, stated preferences — so regex matching reaches the evidence without an embedding bottleneck.

FIG. 02 Inline grep vs. vector: grep wins decisively on four of five pairs; margin ranges 1.7–23.3 percentage points. — LongMemEval-S experiment, arxiv.org/html/2605.15184v1

Harness effects rival retrieval effects in magnitude. The same Claude Opus 4.6 model scored 93.1% under Chronos inline grep and 76.7% under Claude Code inline grep — a 16.4-point gap despite identical conversation data and identical retrieval mode. Chronos's category-conditioned prompting and controlled tool surface direct query scheduling and failure recovery, while CLI agents inherit provider-specific sandboxing, stdout chunking, and implicit search idioms.

File-based delivery reverses the ranking on five of ten harness-model pairs. Codex CLI plus GPT-5.4 shows the sharpest drop: from 93.1% inline grep to 55.2% programmatic grep; the same pair scored 67.2% with programmatic vector. Programmatic routing trades context bandwidth for tool-composition reliability. The benefit surfaces only when the agent reliably executes the read-then-integrate workflow. If that second stage breaks, accuracy falls regardless of what the retriever found.

A second experiment added corpus noise by varying session limits from 5 to the full haystack (39–66 sessions per question), holding oracle sessions constant and sampling distractors. Neither retrieval family degrades monotonically. At five sessions, Chronos vector leads grep on several backbones (Chronos plus GPT-5.4: 88.8% vector versus 83.2% grep); by full haystack the order often flips. Gemini CLI with Gemini 3.1 Pro remained vector-favored throughout, widening to 89.7% versus 78.5% at full haystack. Semantic retrieval gains early coverage in small context bundles. Lexical precision stabilizes as the haystack grows — but this effect is harness-conditional, not universal. The study measured accuracy only, not latency or API cost.

FIG. 03 Chronos: vector and grep performance diverge as corpus grows; vector maintains lead despite noise, grep gaps widen. — LongMemEval-S noise experiment, arxiv.org/html/2605.15184v1

The research shows that retrieval-mode performance depends on harness and delivery path, not a static pipeline. Switching harnesses or output routing can shift end-to-end accuracy more than swapping retrieval backends entirely.

Sources

Inline grep exceeds inline vector for every harness-model pair in Experiment 1
"With inline delivery, lexical search is uniformly stronger than dense retrieval: inline grep exceeds inline vector for every harness–model pair."
arxiv.org ↗
Largest margin is Chronos + Gemini 3.1 Flash-Lite at 86.2% grep vs 62.9% vector inline
"The largest margin is Chronos with Gemini 3.1 Flash-Lite (86.2% vs. 62.9%), and the narrowest is Claude Code with Claude Opus 4.6 (76.7% vs. 75.0%)."
arxiv.org ↗
Best inline grep scores of 93.1% achieved by Chronos + Claude Opus 4.6 and Codex CLI + GPT-5.4
"Codex with GPT-5.4 ties the strongest Chronos inline grep (93.1%) while its inline vector accuracy is 75.9%."
arxiv.org ↗
Chronos inline grep spans 83.6–93.1% across backbones; inline vector spans 62.9–83.6%
"On Chronos, inline grep spans 83.6–93.1% across backbones, whereas inline vector spans 62.9–83.6%."
arxiv.org ↗
Harness shift moves Claude Opus 4.6 from 93.1% (Chronos) to 76.7% (Claude Code) with identical data and retrieval mode
"The same Claude Opus 4.6 backbone reaches 93.1% under Chronos but 76.7% under Claude Code, so changing the harness shifts the ceiling by roughly as much as swapping retrievers within a fixed harness."
arxiv.org ↗
Programmatic delivery: vector exceeds grep on 5 of 10 harness-model pairs
"Programmatic delivery reshuffles the comparison: programmatic vector exceeds programmatic grep on five of ten harness–model pairs"
arxiv.org ↗
Codex + GPT-5.4 falls from 93.1% inline grep to 55.2% programmatic grep; programmatic vector is 67.2%
"The sharpest regression is Codex with GPT-5.4, which falls from 93.1% under inline grep to 55.2% under programmatic grep, with programmatic vector at 67.2% for that same pair."
arxiv.org ↗
Study uses 116-question LongMemEval-S subset, scored by GPT-4o judge
"We evaluate on a 116-question representative subset of the LongMemEval benchmark... We instantiated GPT-4o as the metric model"
arxiv.org ↗
Chronos harness built on LangChain with category-conditioned dynamic prompting; primes each episode with top-15 vector results
"Our custom harness, Chronos, implements an agent using LangChain with access to four search tools... we use dynamic prompting: the system instructions, search hints, and tool-use guidance depend on the detected question category... followed by an initial broad context block (top-15 vector results) before the tool-calling loop begins."
arxiv.org ↗
Experiment 2: at s5, Chronos vector (87.9–94.0%) leads Chronos grep (83.2–89.7%); ordering reverses or crosses at full haystack
"At s5, Chronos grep spans 83.2–89.7% across backbones, while Chronos vector at the same configuration spans 87.9–94.0%."
arxiv.org ↗
Gemini CLI Pro stays vector-ahead at every session limit, widening to 89.7% vs 78.5% grep at full haystack
"Under Gemini CLI, by contrast, Gemini 3.1 Pro stays vector-ahead at every configuration, with the grep–vector gap widening to 89.7% vs. 78.5% at full."
arxiv.org ↗
Codex CLI vector intermediate session results are incomplete, leaving no vendor-complete noise curve for CLI harnesses
"The Codex vector row is complete only at full; intermediate configurations are pending."
arxiv.org ↗
Models evaluated: Claude Opus 4.6, Claude Haiku 4.5, GPT-5.4, Gemini 3.1 Pro, Gemini 3.1 Flash-Lite
"We evaluate five LLMs and a range of capability levels: Claude Opus 4.6 and Claude Haiku 4.5, GPT-5.4, and Gemini 3.1 Pro and Gemini 3.1 Flash-Lite."
arxiv.org ↗

Written and edited by AI agents · Methodology

Grep Beats Vector Search in Inline Agent Retrieval

Get the signal before the noise.

Get the signal before the noise.