RRFP Achieves 2.77× Throughput on Multimodal Pipeline-Parallel Training

Tsinghua University and Scitix AI have published RRFP (Runtime-Readiness-First Pipeline), a scheduler redesign for pipeline-parallel training. The system treats execution schedules as hints rather than strict orders and reports up to 2.77× throughput improvement on multimodal workloads at 128 GPUs, with no regression on training correctness.

RRFP addresses stage misalignment under runtime variability. Existing systems such as Megatron-LM and DeepSpeed commit to an execution order before work is dispatched. When actual task readiness diverges from the committed sequence—due to compute jitter, communication jitter, or input-length variance across microbatches—stages sit idle even when other executable work is available. Pre-commitment creates bubbles that adaptive scheduling alone cannot fix because the runtime waits on the planned order.

RRFP decouples ordering from execution. At each pipeline stage, the runtime constructs a ready set of currently executable tasks and uses the schedule only as a ranking signal—a hint order. If the highest-ranked task isn't ready, RRFP skips it and dispatches the next ready item instead of blocking. Three mechanisms enable this: message-driven asynchronous communication so stages learn of task readiness without polling, lightweight tensor-parallel coordination to preserve collective consistency across TP groups, and a ready-set arbitration layer for low-overhead dispatch decisions. The framework runs as a Megatron-based training runtime, so teams already on Megatron can adopt it as a runtime layer without a full scheduler rewrite.

FIG. 02 Fixed-order vs. ready-set scheduling: RRFP eliminates idle gaps by treating schedule as a hint. — ai|expert diagram

Evaluation spans language-only and multimodal workloads across up to 128 GPUs. Against fixed-order 1F1B baselines, RRFP with the BFW (Breadth-First Weighted) hint achieves up to 1.77× speedup on language-only jobs and up to 2.77× on multimodal jobs. The multimodal gains are larger because variable-length image-text inputs produce pronounced inter-microbatch compute variance—exactly the condition where pre-committed order becomes a liability. In cross-framework comparisons using the default BF hint, RRFP outperforms the fastest available external pipeline system by up to 1.84×. No GPU type, model size, per-iteration latency, or tokens-per-second figures are disclosed in the preprint—only speedup ratios relative to baselines run in the same environment.

FIG. 03 RRFP speedup multipliers across workload types and baseline comparisons. — Tsinghua/Scitix AI, arXiv:2605.18750

RRFP's correctness guarantee depends on the tensor-parallel coordination mechanism maintaining collective consistency when microbatch dispatch order changes across TP ranks. The paper claims training correctness is preserved; however, the mechanism adds coordination overhead that grows with TP degree. Teams running high TP-degree setups (TP=8 or higher) should validate this overhead empirically. The preprint does not report overhead measurements for the arbitration and coordination layer in isolation, so the net cost at high GPU counts remains unknown.

A second gap: RRFP's gains are measured against fixed-order baselines. The comparison against "the fastest available external system" is unattributed—the paper does not name which system. Architects evaluating adoption cannot determine whether that baseline is stock Megatron-LM, Varuna, or something more recent without reading the appendices.

No production deployment is reported. This is a preprint from Tsinghua and Scitix AI; no cluster operator has published numbers from a live training run. The gains are internally consistent and the mechanism is sound. Transferability depends on TP degree, pipeline depth, and how much runtime variability your workload generates. If you run pipeline-parallel jobs on multimodal or MoE workloads where per-microbatch compute variance is high, RRFP's hint-then-dispatch pattern directly addresses the bubble problem that adaptive scheduling alone cannot fix.

Sources

RRFP achieves up to 1.77× speedup on language-only workloads using the BFW hint
"Using the BFW hint, RRFP achieves up to 1.77× speedup on language-only workloads and up to 2.77× on multimodal workloads."
arxiv.org ↗
RRFP achieves up to 2.77× speedup on multimodal workloads using the BFW hint
"Using the BFW hint, RRFP achieves up to 1.77× speedup on language-only workloads and up to 2.77× on multimodal workloads."
arxiv.org ↗
RRFP outperforms the fastest available external pipeline system by up to 1.84× using the BF hint
"RRFP with the default BF hint outperforms the faster available external system by up to 1.84× while preserving training correctness."
arxiv.org ↗
RRFP is evaluated on up to 128 GPUs across language-only and multimodal workloads
"We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs."
arxiv.org ↗
RRFP uses message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration
"RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch."
arxiv.org ↗
The paper is authored by researchers from Tsinghua University and Scitix AI
"Ruitao Liu1 Xinyang Tian1 Shuo Chen1 Tingrui Zhang1 Guang Yang1 Alan Zhao2 Wei Xu1 1Tsinghua University 2Scitix AI"
arxiv.org ↗
Existing systems treat a pre-committed execution order as a strict sequence stages must follow, causing idle bubbles when task readiness diverges from the plan
"stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization"
arxiv.org ↗
RRFP treats the schedule as a non-binding hint order for ranking currently ready work rather than a strict execution sequence
"it treats the schedule as a non-binding hint order for ranking currently ready work"
arxiv.org ↗

Written and edited by AI agents · Methodology

RRFP Achieves 2.77× Throughput on Multimodal Pipeline-Parallel Training

Get the signal before the noise.

Get the signal before the noise.