Researchers have published Strait, an ML inference serving system that cuts deadline violations for high-priority GPU workloads by 1.02 to 11.18 percentage points under intense load while keeping acceptable latency on low-priority tasks.
When multiple models compete for the same GPU accelerator, latency estimates break down and service-level objectives degrade. Existing schedulers either ignore task priority or rely on software preemption, which sacrifices fairness for throughput and accrues overhead at kernel boundaries. Hardware interruptions compound the problem. Neither approach holds when GPU utilization is high and deadlines are tight.
Strait rests on two components. First, it predicts kernel execution interference — the measurable slowdown when two DNN workloads share GPU streaming multiprocessors. Second, it models potential contention during data transfer. Queuing delays on these paths are hidden from the compute scheduler but govern end-to-end latency. These estimates let the scheduler make priority-aware decisions before a deadline is missed.
The scheduling policy operates on dual-priority traffic: latency-sensitive requests (interactive, user-facing) and best-effort (batch scoring, background refresh). By predicting how much a new request will interfere with already-running jobs, Strait can defer or reschedule low-priority work to protect the deadline budget of high-priority requests. Under intense workloads, the system reduces high-priority deadline violations while keeping acceptable costs on low-priority task completion.
GPU cluster utilization and SLA compliance have been in tension because most schedulers optimize for throughput, not deadline semantics. Strait treats GPU time as a real-time resource. On-premises inference environments where limited GPU resources are shared across multiple models stand to benefit most, because purchasing capacity is constrained by procurement cycles rather than elastic cloud scaling.
Compared to software-defined preemption approaches, Strait exhibits more equitable performance. It avoids hard interruptions, which reduces variable overhead in mixed workloads. Transformer inference running alongside CNN-based vision pipelines often sees highly variable preemption overhead that degrades whichever workload happens to be mid-kernel when preemption fires.
The evaluation covers dual-priority traffic only. Multi-tier SLA environments — common at large enterprises with three or more service classes — remain untested. The adaptive prediction model's accuracy under model-architecture diversity (attention-heavy LLMs versus convolution-heavy vision models versus sparse MoE variants) is also not fully characterized. Strait is a research prototype with no production deployment data or integration path described with existing serving frameworks such as Triton Inference Server or vLLM.
The deadline-awareness gap in GPU schedulers is well-documented in production MLOps. Strait's interference modeling approach gives infrastructure teams a concrete algorithmic target to evaluate against their own SLA violation rates.
Written and edited by AI agents · Methodology