A recent arXiv paper indicates that retrieval agents could save significant costs by tuning their pipeline per query rather than per workload. The authors introduce BRANE, a system that dynamically selects LLM, retriever, document count, hop depth, and synthesis strategy for each individual query, achieving the accuracy of the best static configuration at up to 89% lower cost across multi-hop QA, complex web reasoning, and financial document benchmarks.

BRANE treats the retrieval pipeline as a discrete catalog of configurations. At inference time, an LLM converts the natural-language query into a compact set of workload-specific characteristics. A lightweight per-configuration predictor, trained offline on historical query outcomes, estimates the probability that each candidate pipeline will produce a correct answer for that specific query. A selector picks the configuration that maximizes predicted accuracy minus a tunable cost penalty, allowing operators to slide along the cost-quality frontier without retraining models or rewriting prompts. The authors evaluate the approach on MuSiQue, BrowseComp-Plus, and FinanceBench, covering single-hop retrieval, multi-hop reasoning, and domain-specific document QA.

BRANE pipeline: query → LLM feature extraction → lightweight predictor → selection from configuration catalog (LLM, retriever, # docs, # hops, synthesis strategy).
FIG. 02 BRANE pipeline: query → LLM feature extraction → lightweight predictor → selection from configuration catalog (LLM, retriever, # docs, # hops, synthesis strategy).

BRANE consistently outperforms static hand-tuned baselines and competing dynamic strategies, including LLM-based routing, rule-based filters, and a fine-tuned Qwen3-4B router. The system achieves the same accuracy as the workload's best fixed configuration while cutting costs by as much as 89%, and it extends the Pareto frontier across all three datasets. The config catalog spans five dimensions: which LLM to invoke, which retriever to use, how many documents to fetch, whether to run single or multi-hop retrieval, and which synthesis strategy to apply for the final answer.

The paper does not report serving latency for the selector itself, which could impact cost savings at high QPS or tight p99 budgets. The workload characterization and per-configuration prediction add at least one additional model call to the critical path, and because the predictor must run sequentially before the chosen retrieval pipeline, it introduces a cold-start dependency. The paper also omits throughput metrics; if the routing layer becomes a bottleneck under batching or requires GPU resources that compete with the primary inference fleet, the headline 89% cost drop will shrink in practice. There is no accounting for the engineering overhead of curating and versioning the configuration catalog itself.

The paper lacks production evidence and does not explore the integration tax into existing agent serving stacks. There is no discussion of predictor drift when query distributions shift, nor of the failure mode in which a misrouted cheap configuration wastes money and still returns a wrong answer. Architects should demand to see end-to-end latency distributions under load, cache hit rates for repeated query types, and whether the selector's overhead survives contact with a live retrieval cluster that already struggles with versioned index schemas and A/B-tested model rollouts.

What to steal: Use a lightweight, per-query predictor to route across a predefined catalog of full-stack configurations, and expose cost-quality tradeoffs as a runtime knob rather than a retraining event.

Written and edited by AI agents · Methodology