Mallika Rao, in an InfoQ presentation at QCon AI New York 2025, contended that evaluation debt, rather than model inaccuracy, disrupts production AI pipelines and diminishes user trust. Rao, with experience leading personalized search at Twitter, recommendation platforms at Netflix, and cash rewards infrastructure at Walmart, supported her argument with examples of systems operating at a global scale. Twitter's search indexes process trillions of documents across hundreds of microservices under a 50 millisecond latency SLA; Netflix's content systems make billions of personalization decisions daily; and Walmart's cash rewards product handles transactions for 25 million users monthly across 50-state compliance boundaries.

Rao detailed a five-layer evaluation stack that architects must maintain alongside their inference architecture, covering infrastructure health and latency, retrieval correctness and safety, and UX-level semantic quality. She used case studies of a personalized semantic search pipeline with sub-100 millisecond latency budgets and Walmart's cash rewards system, both of which incorporate LLMs, embedding models, vector stores, multistage ranking layers, and agents, but rely on outdated 2018-era tooling for validation.

Rao's five-layer evaluation stack: from infrastructure to user trust. Architects must maintain all layers alongside inference architecture.
FIG. 02 Rao's five-layer evaluation stack: from infrastructure to user trust. Architects must maintain all layers alongside inference architecture.

AI systems fail semantically, not structurally. While a database crash is evident, a production model returning technically valid but contextually incorrect outputs erodes trust silently. Rao termed these "silent failures," which accumulate as aggregate metrics remain green. Precision and recall metrics are inadequate as they assume fixed correctness, while dynamic retrieval introduces context-dependent failure surfaces that evolve with the product. The discrepancy between measured metrics and potential issues is evaluation debt, an invisible liability that grows until it impacts production.

Operational constraints highlight the stakes. At Twitter, queries touch hundreds of microservices within a 50 millisecond budget; at Netflix, billions of ranking decisions must be completed within a tight latency window; and Walmart's 25 million monthly users engage in transactions where errors have financial and legal consequences. Rao paired these constraints with a diagnostic maturity model to help leaders prioritize evaluation investments.

The challenge lies in instrumenting semantic correctness at scale. As pipelines incorporate agents, embedding layers, and vector retrieval, the failure surface expands, yet most production observability stacks lack automated semantic checks that can run inline without exceeding latency budgets. Rao noted identical evaluation infrastructure gaps in both the search and cash rewards systems, indicating that architectures evolved while assessments did not, threatening relevance and financial loss.

Architects must map the five-layer framework onto their own stacks without a prescribed toolchain, and the maturity model offers sequencing logic but no vendor shortcuts. Treat the evaluation stack as a living architecture that must be versioned and sequenced alongside every new model, agent, and retrieval layer shipped.

Written and edited by AI agents · Methodology