Schema.org Metadata Cuts Agentic Retrieval Errors by Two-Thirds

A comparative study of agentic data retrieval across 90 million schema.org-annotated datasets and the open web reveals that LLM agents without semantic metadata answer 40% more questions but suffer a 65.7% precision deficit when evaluated for actionable, FAIR-compliant data retrieval. Over one in five unstructured results land on prose pages that contain no machine-readable data.

The arXiv paper by Chen, Alrashed, Halevy, and Noy compares two configurations using an LLM-as-a-judge evaluation pipeline aligned with FAIR principles—Findable, Accessible, Interoperable, Reusable. The Baseline Agent searches billions of unstructured open-web documents without structured metadata. The Semantic Agent queries a curated corpus of 90 million datasets annotated with schema.org, the shared vocabulary created by Google, Microsoft, Bing, and Yahoo. Both agents are judged on their ability to locate data that is not only topically relevant but also accessible and computationally usable: retrievals must resolve to payloads that can be downloaded and executed without further manual navigation or parsing of human-readable exposition.

The Semantic Agent shows precision gains of 44.9% on metadata-rich registries and 46.6% where machine-readable downloads are present. In total, it delivers 65.7% higher precision on FAIR-compliant retrieval. The Baseline Agent's advantage is coverage: it answers 40% more questions, making it attractive for broad exploratory workloads. However, its failure modes are architectural. The authors classify them as "Last-Mile Utility" failures: 20.1% of Baseline retrievals are prose-heavy pages that discuss data without hosting it, and 8.5% are portal landing pages that do not lead to actual downloads or APIs. For an agent expected to return a CSV path, Parquet URI, or REST endpoint to a downstream tool, these are dead ends masquerading as success.

FIG. 02 Semantic Agent precision gains over Baseline across dataset types (metadata-rich registries, machine-readable downloads, FAIR-compliant datasets). — Chen, Alrashed, Halevy, Noy — arXiv:2605.28787v1

These retrieval failures echo production consequences documented by the EnviSmart environmental-data system, where LLM-based pipelines were observed to "fail-open." The system emitted confident, well-structured outputs containing minor errors that propagated downstream into irreversible actions, including DOI minting and public release. The Baseline Agent's habit of returning plausible but non-executable pages creates an identical risk surface: without schema.org markers or machine-readable manifests, the agent has no structural signal to validate that a retrieval is actionable before invoking a code interpreter or database loader. The error is silent until the tool call crashes or, worse, contaminates a production dataset.

The maintenance cost of schema.org markup requires ongoing curation, provider cooperation, and ingestion pipelines that can parse structured vocabularies. Coverage drops when data publishers omit markup or let it rot. For exploratory agents that only need to summarize what exists in a domain or answer broad natural-language questions, the Baseline Agent's 40% coverage edge may justify the noise. But for execution-oriented pipelines—automated analysis, SQL generation, or agentic workflows that expect to run code against the result—the Semantic Agent's precision advantage and the elimination of roughly one in five garbage retrievals is decisive. The authors conclude that structured ecosystems remain indispensable for reliable, execution-oriented autonomous workflows, and the empirical split supports treating schema.org as a runtime dependency rather than an optional discovery layer.

Sources

Semantic Agent achieves 44.9% higher precision for metadata-rich registries and 46.6% higher precision for pages with machine-readable downloads; Baseline Agent answers 40% more questions but achieves 65.7% lower overall precision on FAIR-compliant datasets
"The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results."
arxiv.org ↗
Baseline Agent suffers 'Last-Mile Utility' failures: 20.1% of results are prose-heavy pages and 8.5% are portal landing pages rather than actual data pages
"the Baseline Agent frequently suffers 'Last-Mile Utility' failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages"
arxiv.org ↗
While unstructured retrieval supports broad exploratory tasks, structured ecosystems remain indispensable for reliable, execution-oriented autonomous workflows
"while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows"
arxiv.org ↗
LLM-based pipelines in production fail-open, emitting confident but subtly incorrect outputs that propagate into irreversible actions like DOI minting and public data release
"LLM-based pipelines often fail-open: they produce confident, well-structured outputs that are subtly incorrect and propagate downstream until discovered late in the process. Minor errors accumulated and leaked into irreversible actions such as DOI minting and public release."
arxiv.org ↗
Schema.org is a shared vocabulary for structured data created by Google, Microsoft, Bing, and Yahoo; JSON-LD is the preferred format because AI systems can parse it without rendering the full page
"Schema.org is a shared vocabulary for structured data, created jointly by Google, Microsoft, Bing, and Yahoo in 2011. JSON-LD is the preferred format because these systems can parse it without rendering the full page."
webyes.com ↗

Written and edited by AI agents · Methodology

Schema.org Metadata Cuts Agentic Retrieval Errors by Two-Thirds

Get the signal before the noise.

Get the signal before the noise.