Strait cuts high-priority GPU deadline violations by 11 points

Researchers have published Strait, an ML inference serving system that cuts deadline violations for high-priority GPU workloads by 1.02 to 11.18 percentage points under intense load while keeping acceptable latency on low-priority tasks.

When multiple models compete for the same GPU accelerator, latency estimates break down and service-level objectives degrade. Existing schedulers either ignore task priority or rely on software preemption, which sacrifices fairness for throughput and accrues overhead at kernel boundaries. Hardware interruptions compound the problem. Neither approach holds when GPU utilization is high and deadlines are tight.

FIG. 02 Strait reduces deadline violations by 1.02–11.18 percentage points depending on workload intensity, with the largest gains under heavy contention. — Strait arXiv:2604.28175v1

Strait rests on two components. First, it predicts kernel execution interference — the measurable slowdown when two DNN workloads share GPU streaming multiprocessors. Second, it models potential contention during data transfer. Queuing delays on these paths are hidden from the compute scheduler but govern end-to-end latency. These estimates let the scheduler make priority-aware decisions before a deadline is missed.

The scheduling policy operates on dual-priority traffic: latency-sensitive requests (interactive, user-facing) and best-effort (batch scoring, background refresh). By predicting how much a new request will interfere with already-running jobs, Strait can defer or reschedule low-priority work to protect the deadline budget of high-priority requests. Under intense workloads, the system reduces high-priority deadline violations while keeping acceptable costs on low-priority task completion.

FIG. 03 Strait's architecture: dual-priority requests flow through an interference-aware scheduler that predicts kernel contention and routes tasks to minimize deadline violations. — ai|expert diagram

GPU cluster utilization and SLA compliance have been in tension because most schedulers optimize for throughput, not deadline semantics. Strait treats GPU time as a real-time resource. On-premises inference environments where limited GPU resources are shared across multiple models stand to benefit most, because purchasing capacity is constrained by procurement cycles rather than elastic cloud scaling.

Compared to software-defined preemption approaches, Strait exhibits more equitable performance. It avoids hard interruptions, which reduces variable overhead in mixed workloads. Transformer inference running alongside CNN-based vision pipelines often sees highly variable preemption overhead that degrades whichever workload happens to be mid-kernel when preemption fires.

The evaluation covers dual-priority traffic only. Multi-tier SLA environments — common at large enterprises with three or more service classes — remain untested. The adaptive prediction model's accuracy under model-architecture diversity (attention-heavy LLMs versus convolution-heavy vision models versus sparse MoE variants) is also not fully characterized. Strait is a research prototype with no production deployment data or integration path described with existing serving frameworks such as Triton Inference Server or vLLM.

The deadline-awareness gap in GPU schedulers is well-documented in production MLOps. Strait's interference modeling approach gives infrastructure teams a concrete algorithmic target to evaluate against their own SLA violation rates.

Sources

Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points under intense workloads
"Strait reduces deadline violations for high-priority tasks by 1.02 to 11.18 percentage points while incurring acceptable costs on low-priority tasks"
arxiv.org ↗
Strait incurs acceptable costs on low-priority tasks
"while incurring acceptable costs on low-priority tasks"
arxiv.org ↗
Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model
"Strait models potential contention during data transfer and accounts for kernel execution interference through an adaptive prediction model"
arxiv.org ↗
Strait targets dual-priority inference traffic under high GPU utilization
"a serving system designed to enhance deadline satisfaction for dual-priority inference traffic under high GPU utilization"
arxiv.org ↗
Compared to software-defined preemption approaches, Strait exhibits more equitable performance
"Compared to software-defined preemption approaches, Strait also exhibits more equitable performance"
arxiv.org ↗
Strait is designed for on-premises inference scenarios where limited GPU resources are shared across multiple models
"limited support for task prioritization and insufficient latency estimation under concurrent execution may restrict their applicability in on-premises scenarios"
arxiv.org ↗
Strait was authored by Haidong Zhao and Nikolaos Georgantas and published April 30, 2026
"AUTHORS: Haidong Zhao, Nikolaos Georgantas — PUBLISHED: 2026-04-30"
arxiv.org ↗

Written and edited by AI agents · Methodology

Strait cuts high-priority GPU deadline violations by 11 points

Get the signal before the noise.

Get the signal before the noise.