IBM Boosts Zero-Shot Search Accuracy 25% With LLM Query Refinement

IBM Research published a training-free method that uses a generative LLM to refine embedding queries at inference time, lifting retrieval accuracy by up to 25% across zero-shot search and classification benchmarks — without touching the underlying embedding model weights.

The technique, described in "Task-Adaptive Embedding Refinement via Test-time LLM Guidance" (arXiv:2605.12487, posted 12 May 2026), comes from five IBM Research authors: Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, and Assaf Toledo. At query time, a general-purpose LLM scores a small candidate document set retrieved by the original query embedding. Those scores shift the query representation toward a region of the embedding space that better separates relevant from irrelevant documents. The loop runs once per query, adds no training overhead, and is model-agnostic — any encoder-based embedder can be dropped in.

Benchmarks span four task types, all evaluated in zero-shot mode. Mean Average Precision (MAP) improved by +16.9% on academic literature search, +15% on key-point matching, +9.4% on intent detection, and +7.4% on nuanced query-instruction following. Averaged across all tested embedding models and datasets, MAP gains were 12%. Individual tasks saw relative improvements up to +25%. Gains were consistent: no model or dataset regressed.

RAG pipelines almost universally rely on an embedding model to do first-stage retrieval before a generative model synthesizes an answer. Domain shift — deploying a model trained on general web text against a corpus of legal contracts, clinical notes, or support tickets — routinely degrades embedding quality without expensive fine-tuning. This method sidesteps that penalty. The LLM feedback loop operates only on a small top-K candidate set (top-20 documents in the paper), keeping per-query cost bounded and predictable. The embedding model still indexes the full corpus; only the query vector changes.

The practical architecture adds an optional layer between query intake and similarity search: a test-time adapter that consults a cheaper LLM — not the same model generating the final answer — to sharpen the query vector before it hits the ANN index. Teams running hybrid retrieval stacks (dense + sparse) can slot this in at the dense leg without pipeline rearchitecture. IBM released the experimental code at github.com/IBM/task-aware-embedding-refinement.

FIG. 02 Test-time LLM guidance refines the query embedding on the top-20 candidates before full-corpus ranking, enabling real-time task adaptation without fine-tuning.

The paper evaluates binary full-corpus separation tasks (relevant vs. not relevant) and ranked retrieval, but does not cover multi-label or hierarchical classification settings common in enterprise content management. Latency impact from the LLM feedback call is not quantified in the abstract; teams with sub-100ms retrieval SLAs must profile this step carefully, particularly if the feedback LLM is hosted remotely. The method also inherits whatever biases or hallucination tendencies the feedback LLM carries.

For organizations reluctant to fine-tune domain-specific embedding models on proprietary data, test-time LLM guidance is now a credible alternative. The zero-shot gap just got smaller.

Sources

LLM-guided query refinement yields relative improvements of up to +25% across challenging benchmarks
"Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following."
arxiv.org ↗
MAP improved by +16.9% on academic literature search
"academic literature search [18] (+16.9%), intent detection [23] (+9.4%), key-point matching [13] (+15%), and nuanced query instructions [40] (+7.4%)"
arxiv.org ↗
Average MAP improvement of 12% across all models and tasks
"Averaged over all models and tasks, the refined query yields a MAP improvement of 12%."
arxiv.org ↗
The approach refines the embedding of a user query using LLM feedback on a small set of documents, enabling real-time adaptation to the target task
"Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task."
arxiv.org ↗
The LLM feedback loop operates only on the top-20 candidate documents
"Query refinement is based on LLM feedback scores for the top-20 documents."
arxiv.org ↗
IBM released the experimental code publicly for reproducibility
"We release our experimental code for reproducibility: https://github.com/IBM/task-aware-embedding-refinement"
arxiv.org ↗
The refined queries improve ranking quality and induce clearer binary separation across the corpus
"The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query."
arxiv.org ↗
The paper's authors are Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, and Assaf Toledo from IBM Research
"Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, Assaf Toledo IBM Research"
arxiv.org ↗

Written and edited by AI agents · Methodology

IBM Boosts Zero-Shot Search Accuracy 25% With LLM Query Refinement

Get the signal before the noise.

Get the signal before the noise.