Silent Failures Accrue: Why Evaluation Debt Matters

Mallika Rao, in an InfoQ presentation at QCon AI New York 2025, contended that evaluation debt, rather than model inaccuracy, disrupts production AI pipelines and diminishes user trust. Rao, with experience leading personalized search at Twitter, recommendation platforms at Netflix, and cash rewards infrastructure at Walmart, supported her argument with examples of systems operating at a global scale. Twitter's search indexes process trillions of documents across hundreds of microservices under a 50 millisecond latency SLA; Netflix's content systems make billions of personalization decisions daily; and Walmart's cash rewards product handles transactions for 25 million users monthly across 50-state compliance boundaries.

Rao detailed a five-layer evaluation stack that architects must maintain alongside their inference architecture, covering infrastructure health and latency, retrieval correctness and safety, and UX-level semantic quality. She used case studies of a personalized semantic search pipeline with sub-100 millisecond latency budgets and Walmart's cash rewards system, both of which incorporate LLMs, embedding models, vector stores, multistage ranking layers, and agents, but rely on outdated 2018-era tooling for validation.

FIG. 02 Rao's five-layer evaluation stack: from infrastructure to user trust. Architects must maintain all layers alongside inference architecture.

AI systems fail semantically, not structurally. While a database crash is evident, a production model returning technically valid but contextually incorrect outputs erodes trust silently. Rao termed these "silent failures," which accumulate as aggregate metrics remain green. Precision and recall metrics are inadequate as they assume fixed correctness, while dynamic retrieval introduces context-dependent failure surfaces that evolve with the product. The discrepancy between measured metrics and potential issues is evaluation debt, an invisible liability that grows until it impacts production.

Operational constraints highlight the stakes. At Twitter, queries touch hundreds of microservices within a 50 millisecond budget; at Netflix, billions of ranking decisions must be completed within a tight latency window; and Walmart's 25 million monthly users engage in transactions where errors have financial and legal consequences. Rao paired these constraints with a diagnostic maturity model to help leaders prioritize evaluation investments.

The challenge lies in instrumenting semantic correctness at scale. As pipelines incorporate agents, embedding layers, and vector retrieval, the failure surface expands, yet most production observability stacks lack automated semantic checks that can run inline without exceeding latency budgets. Rao noted identical evaluation infrastructure gaps in both the search and cash rewards systems, indicating that architectures evolved while assessments did not, threatening relevance and financial loss.

Architects must map the five-layer framework onto their own stacks without a prescribed toolchain, and the maturity model offers sequencing logic but no vendor shortcuts. Treat the evaluation stack as a living architecture that must be versioned and sequenced alongside every new model, agent, and retrieval layer shipped.

Sources

Evaluation debt—not model inaccuracy—is what breaks production AI pipelines and erodes user trust
"Very rarely do the models actually come in the way of shipping products that thrive. It's actually your evaluation frameworks that can break your products, break your pipelines, and actually touch that user trust."
infoq.com ↗
Twitter's search indexes trillions of documents and serves queries across hundreds of microservices under a sub-50 millisecond latency SLA
"I have led search infrastructure teams at Twitter, trillions of documents, sub-50 millisecond latency budgets at global scale... every query touches hundreds of microservices internally."
infoq.com ↗
Netflix's content systems make billions of personalization decisions daily
"Most recently, the content systems at Netflix, where we process billions of personalization decisions every day for global scale."
infoq.com ↗
Walmart's cash rewards product processes dollar-denominated transactions for 25 million users every month across 50-state compliance boundaries
"Cash rewards for, let's say, 25 million users every month, dollar denominated transactions, zero scope for error... compliance requirements across 50 states."
infoq.com ↗
Evaluation debt is defined as the gap when system architectures evolve but evaluation infrastructure stays stuck
"It's what happens when your system architectures have evolved, gotten more sophisticated, but your evaluation infrastructure doesn't. It's stuck in 2018."
infoq.com ↗
AI systems return results that are technically correct but completely wrong for the user — dashboards stay green while user trust erodes (silent failures)
"They fail semantically. They return results that are technically correct, but completely wrong for the user. Your dashboards are green, your metrics look good, but something's not ok with how your users are responding to your products."
infoq.com ↗
Evaluation debt accumulates silently and explodes spectacularly when it surfaces
"It accumulates silently and explodes spectacularly."
infoq.com ↗
Rao presented a five-layer evaluation stack spanning infrastructure and UX, and a diagnostic maturity model for engineering leaders
"She explains why traditional metrics fail modern architectures, breaks down a five-layer evaluation stack spanning infrastructure and UX, and shares a diagnostic maturity model to help engineering leaders eliminate silent semantic failures."
infoq.com ↗
Both the search and cash rewards systems showed the same root cause despite radically different stakes
"Very different systems, very different architectures, very different engineering challenges, and very different business stakes, but the same error pattern, same infrastructure gaps, and the same root cause, the way I see it, evaluation debt."
infoq.com ↗
As organizations adopt AI at scale, evaluation becomes the backbone of trust, safety, and product readiness
"As organizations adopt AI at scale, evaluation becomes the backbone of trust, safety, and product readiness."
ai.qconferences.com ↗
QCon AI New York 2025 focused on moving AI from PoC to production; Rao's talk addressed identifying risks, biases, and vulnerabilities through rigorous evaluation
"To secure an AI system, you must be able to evaluate its behavior and performance rigorously... identifying potential risks, biases, and vulnerabilities before they can be exploited or cause harm."
infoq.com ↗

Written and edited by AI agents · Methodology

Silent Failures Accrue: Why Evaluation Debt Matters

Get the signal before the noise.

Get the signal before the noise.