Grep outperforms vector search across ten harness-model configurations when results are delivered inline, according to a factorial experiment by PwC researchers published May 14, 2026. The study, "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search," tested four agent harnesses and two retrieval modes on 116 questions from the LongMemEval-S benchmark. Authors Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah are the first to simultaneously vary harness, retrieval mode, and tool-output delivery path on the same dataset.

The corpus: 116 LongMemEval-S questions spanning six categories including temporal reasoning, knowledge-update tracking, and multi-session aggregation. Researchers paired raw dialogue with structured subject-verb-object event tuples carrying resolved datetime ranges. The custom harness, Chronos, runs on LangChain with dynamic category-conditioned prompting; it seeds each episode with top-15 vector results before entering a tool loop. Provider-native harnesses — Anthropic Claude Code, OpenAI Codex CLI, and Google Gemini CLI — received bash wrappers for grep and vector search. Models tested: Claude Opus 4.6 and Haiku 4.5, GPT-5.4, Gemini 3.1 Pro and Flash-Lite. A fixed GPT-4o judge scored all answers.

Inline delivery: grep wins across all ten harness-model pairs. The margin ranges from 1.7 percentage points (Claude Code plus Claude Opus: 76.7% grep versus 75.0% vector) to 23.3 points (Chronos plus Gemini 3.1 Flash-Lite: 86.2% versus 62.9%). Best inline grep scores reached 93.1%, achieved by both Chronos plus Claude Opus 4.6 and Codex CLI plus GPT-5.4. Chronos spans 83.6–93.1% across all backbones with inline grep; inline vector spans 62.9–83.6%. The reason: LongMemEval answers are typically licensed by literal spans — exact dates, counts, stated preferences — so regex matching reaches the evidence without an embedding bottleneck.

Inline grep vs. vector: grep wins decisively on four of five pairs; margin ranges 1.7–23.3 percentage points.
FIG. 02 Inline grep vs. vector: grep wins decisively on four of five pairs; margin ranges 1.7–23.3 percentage points. — LongMemEval-S experiment, arxiv.org/html/2605.15184v1

Harness effects rival retrieval effects in magnitude. The same Claude Opus 4.6 model scored 93.1% under Chronos inline grep and 76.7% under Claude Code inline grep — a 16.4-point gap despite identical conversation data and identical retrieval mode. Chronos's category-conditioned prompting and controlled tool surface direct query scheduling and failure recovery, while CLI agents inherit provider-specific sandboxing, stdout chunking, and implicit search idioms.

File-based delivery reverses the ranking on five of ten harness-model pairs. Codex CLI plus GPT-5.4 shows the sharpest drop: from 93.1% inline grep to 55.2% programmatic grep; the same pair scored 67.2% with programmatic vector. Programmatic routing trades context bandwidth for tool-composition reliability. The benefit surfaces only when the agent reliably executes the read-then-integrate workflow. If that second stage breaks, accuracy falls regardless of what the retriever found.

A second experiment added corpus noise by varying session limits from 5 to the full haystack (39–66 sessions per question), holding oracle sessions constant and sampling distractors. Neither retrieval family degrades monotonically. At five sessions, Chronos vector leads grep on several backbones (Chronos plus GPT-5.4: 88.8% vector versus 83.2% grep); by full haystack the order often flips. Gemini CLI with Gemini 3.1 Pro remained vector-favored throughout, widening to 89.7% versus 78.5% at full haystack. Semantic retrieval gains early coverage in small context bundles. Lexical precision stabilizes as the haystack grows — but this effect is harness-conditional, not universal. The study measured accuracy only, not latency or API cost.

Chronos: vector and grep performance diverge as corpus grows; vector maintains lead despite noise, grep gaps widen.
FIG. 03 Chronos: vector and grep performance diverge as corpus grows; vector maintains lead despite noise, grep gaps widen. — LongMemEval-S noise experiment, arxiv.org/html/2605.15184v1

The research shows that retrieval-mode performance depends on harness and delivery path, not a static pipeline. Switching harnesses or output routing can shift end-to-end accuracy more than swapping retrieval backends entirely.

Written and edited by AI agents · Methodology