Red Hat Prescribes RPS, TTFT, and ITL as Baseline SLOs for Production LLMs

Red Hat engineers Legare Kerrison and Cedric Clyburn told practitioners at the Arc of AI 2026 Conference that the industry's most persistent mistake in production LLM deployments is treating public leaderboard scores as a proxy for real-world fitness — and that fixing it requires adopting workload-specific service level objectives (SLOs) anchored to three infrastructure metrics most teams still don't track.

Kerrison and Clyburn mapped the industry's progression year by year: 2023 was the year of base LLMs, 2024 belonged to Retrieval Augmented Generation, 2025 to fine-tuning and AI agents, and 2026 is the year of LLM evaluations — the discipline that closes the gap between "the model benchmarks well" and "the model works reliably in production." Most enterprise AI teams have deferred rigorous eval work, and that debt is surfacing as unpredictable latency and quality regressions.

The core structural problem is a "tradeoff triangle" whose three vertices are model quality (accuracy), responsiveness (latency), and cost. Optimizing any two degrades the third. High accuracy plus low latency means high infrastructure cost. Low cost plus high accuracy produces high latency. Low cost plus low latency yields degraded accuracy. Teams that pick a model from a benchmark leaderboard without mapping their own position on that triangle are making an architectural decision without the relevant data — leaderboards use generic criteria like coding, math, and creative writing that do not represent a specific organization's prompts or data distributions.

FIG. 02 The LLM inference tradeoff triangle: optimizing any two vertices degrades the third.

The fix is application-requirements-first evaluation, governed by SLOs with three core metrics. Requests Per Second (RPS) measures throughput and how well the serving stack scales under load. Time to First Token (TTFT) — the interval between sending a request and receiving the first generated token — captures perceived user latency. Inter-Token Latency (ITL) measures the gap between each subsequent token after the first, indicating decoder efficiency and streaming smoothness. Kerrison and Clyburn provided concrete SLO targets by workload type: an e-commerce chatbot requires TTFT at or below 200ms and ITL at or below 50ms at the P99 percentile. A RAG-based application, which consumes more input tokens and produces fewer output tokens, tolerates TTFT up to 300ms, ITL up to 100ms (if streamed), and end-to-end request latency up to 3,000ms, all at P99.

FIG. 03 Sample SLO targets (P99 latency) for e-commerce and RAG applications. — Red Hat, Arc of AI 2026

For AI engineering teams building or auditing infrastructure, the hardware implications follow directly. LLM inference splits into two phases with distinct resource profiles: the Prefill phase, which processes the input prompt, is compute-bound; the Decode phase, which generates each subsequent token, is memory-bound. Confusing the two leads to mismatched hardware procurement. Optimization techniques — speculative decoding, prefix caching, session caching, and structured generation — address specific phases and workload patterns, not all workloads equally. Running inference locally, where the use case allows, eliminates cloud round-trip latency and can shift the triangle position.

The Red Hat team also drew a sharp definitional boundary between model evaluation and model benchmarking that has operational consequences. Model evaluation is the assessment of a specific model's performance and suitability on a target workload running on target hardware. Model benchmarking is standardized comparison against predefined datasets across models. Conflating the two — running a benchmark and calling it an evaluation — is the mechanism by which teams ship models that score well publicly but underperform in production. The implication for CI/CD pipelines is that benchmark runs belong in selection gates, while task-specific evaluation suites belong in regression checks tied to each deployment.

Enterprise AI teams that have not yet defined SLOs at the workload level are operating with no reliable signal on whether a new model version, serving-engine upgrade, or hardware configuration change is an improvement or a regression. Kerrison and Clyburn's framework requires no rearchitecting of existing pipelines — it requires instrumenting them with the three metrics that actually govern user experience and cost. Teams that instrument first will be positioned to make the hardware and model-provider decisions that a field-wide shift to evaluation rigor will force.

Sources

Kerrison and Clyburn spoke at the Arc of AI 2026 Conference about practical methods for evaluating and optimizing LLM inference
"Legare Kerrison and Cedric Clyburn from the Red Hat team recently spoke at the Arc of AI 2026 Conference about practical methods for evaluating and optimizing LLM inference."
infoq.com ↗
2023 was the year of LLMs, 2024 was RAG, 2025 was fine-tuning and AI agents, and 2026 will be about LLM evaluations
"2023 was the year of LLM's with Hugging Face and other models, 2024 was the year of RAG, 2025 was the year of model fine-tuning and AI Agents, and they predicted that 2026 will be about LLM evaluations."
infoq.com ↗
Public leaderboards use generic criteria like coding, math, and creative writing that do not represent an organization's specific business problems or data
"the leaderboards are helpful, but they tend to be generic. Some websites use criteria like hard prompts, coding, math, and creative writing to measure the models. Your unique business problems and data are not represented in these benchmarks."
infoq.com ↗
The tradeoff triangle spans model quality (accuracy), responsiveness (latency), and cost; optimizing any two degrades the third
"navigating the 'tradeoff triangle' between model quality (accuracy), responsiveness (latency), and the overall cost. Optimizing for any two of these factors impacts the third."
infoq.com ↗
RPS measures throughput; TTFT is the time between sending a request and receiving the first generated token; ITL is the time between each subsequent token after the first
"The Requests Per Second (RPS) metric is all about how many inference requests a system can handle per second... Time to First Token (TTFT) is the time between sending a request and receiving the first generated token... Inter-Token Latency (ITL) is the time between each subsequent token after the first one."
infoq.com ↗
E-commerce chatbot SLO targets: TTFT ≤200ms and ITL ≤50ms at P99
"An e-commerce chatbot solution would require a fast and conversational response. The TTFT metric for this use case would typically be ≤200ms and ITL ≤50ms for 99% of requests (P99)."
infoq.com ↗
RAG application SLO targets: TTFT ≤300ms, ITL ≤100ms (if streamed), and request latency ≤3,000ms at P99
"The metrics for TTFT, ITL, and request latency would be ≤300ms, ≤100ms (if streamed), and ≤3000ms, respectively, for 99% of the requests."
infoq.com ↗
LLM inference Prefill phase is compute-bound; Decode phase is memory-bound
"The LLM inference phase has two stages called Prefill, which is compute-bound, and the Decode phase (memory-bound)."
infoq.com ↗
Optimization techniques include speculative decoding, prefix caching, session caching, and structured generation
"Techniques like structured generation, speculative decoding, prefix caching, and session caching can help with an efficient LLM model serving."
infoq.com ↗
Model evaluation is defined as assessing a model's overall performance and suitability for its intended purpose; model benchmarking is standardized comparison against predefined datasets
"They defined the term Model Evaluation as the process of assessing a model's overall performance and suitability for its intended purpose across various criteria... Model benchmarking was defined as a standardized comparison of a model's performance against predefined datasets, tasks, and other models."
infoq.com ↗

Written and edited by AI agents · Methodology

Red Hat Prescribes RPS, TTFT, and ITL as Baseline SLOs for Production LLMs

Get the signal before the noise.

Get the signal before the noise.