Red Hat engineers Legare Kerrison and Cedric Clyburn told practitioners at the Arc of AI 2026 Conference that the industry's most persistent mistake in production LLM deployments is treating public leaderboard scores as a proxy for real-world fitness — and that fixing it requires adopting workload-specific service level objectives (SLOs) anchored to three infrastructure metrics most teams still don't track.

Kerrison and Clyburn mapped the industry's progression year by year: 2023 was the year of base LLMs, 2024 belonged to Retrieval Augmented Generation, 2025 to fine-tuning and AI agents, and 2026 is the year of LLM evaluations — the discipline that closes the gap between "the model benchmarks well" and "the model works reliably in production." Most enterprise AI teams have deferred rigorous eval work, and that debt is surfacing as unpredictable latency and quality regressions.

The core structural problem is a "tradeoff triangle" whose three vertices are model quality (accuracy), responsiveness (latency), and cost. Optimizing any two degrades the third. High accuracy plus low latency means high infrastructure cost. Low cost plus high accuracy produces high latency. Low cost plus low latency yields degraded accuracy. Teams that pick a model from a benchmark leaderboard without mapping their own position on that triangle are making an architectural decision without the relevant data — leaderboards use generic criteria like coding, math, and creative writing that do not represent a specific organization's prompts or data distributions.

The LLM inference tradeoff triangle: optimizing any two vertices degrades the third.
FIG. 02 The LLM inference tradeoff triangle: optimizing any two vertices degrades the third.

The fix is application-requirements-first evaluation, governed by SLOs with three core metrics. Requests Per Second (RPS) measures throughput and how well the serving stack scales under load. Time to First Token (TTFT) — the interval between sending a request and receiving the first generated token — captures perceived user latency. Inter-Token Latency (ITL) measures the gap between each subsequent token after the first, indicating decoder efficiency and streaming smoothness. Kerrison and Clyburn provided concrete SLO targets by workload type: an e-commerce chatbot requires TTFT at or below 200ms and ITL at or below 50ms at the P99 percentile. A RAG-based application, which consumes more input tokens and produces fewer output tokens, tolerates TTFT up to 300ms, ITL up to 100ms (if streamed), and end-to-end request latency up to 3,000ms, all at P99.

Sample SLO targets (P99 latency) for e-commerce and RAG applications.
FIG. 03 Sample SLO targets (P99 latency) for e-commerce and RAG applications. — Red Hat, Arc of AI 2026

For AI engineering teams building or auditing infrastructure, the hardware implications follow directly. LLM inference splits into two phases with distinct resource profiles: the Prefill phase, which processes the input prompt, is compute-bound; the Decode phase, which generates each subsequent token, is memory-bound. Confusing the two leads to mismatched hardware procurement. Optimization techniques — speculative decoding, prefix caching, session caching, and structured generation — address specific phases and workload patterns, not all workloads equally. Running inference locally, where the use case allows, eliminates cloud round-trip latency and can shift the triangle position.

The Red Hat team also drew a sharp definitional boundary between model evaluation and model benchmarking that has operational consequences. Model evaluation is the assessment of a specific model's performance and suitability on a target workload running on target hardware. Model benchmarking is standardized comparison against predefined datasets across models. Conflating the two — running a benchmark and calling it an evaluation — is the mechanism by which teams ship models that score well publicly but underperform in production. The implication for CI/CD pipelines is that benchmark runs belong in selection gates, while task-specific evaluation suites belong in regression checks tied to each deployment.

Enterprise AI teams that have not yet defined SLOs at the workload level are operating with no reliable signal on whether a new model version, serving-engine upgrade, or hardware configuration change is an improvement or a regression. Kerrison and Clyburn's framework requires no rearchitecting of existing pipelines — it requires instrumenting them with the three metrics that actually govern user experience and cost. Teams that instrument first will be positioned to make the hardware and model-provider decisions that a field-wide shift to evaluation rigor will force.

Written and edited by AI agents · Methodology