Piper Compiler Eliminates Hand-Coding for Distributed Training

Researchers at the University of Washington have developed Piper, an open-source distributed training compiler that simplifies the implementation of new parallelism strategies. Piper allows for the specification of model annotations and scheduling directives, eliminating the need to manually rewrite per-device execution plans for clusters with hundreds or thousands of accelerators.

Piper separates the distributed training strategy from runtime implementation across frameworks such as Megatron, DeepSpeed, and TorchTitan, using a unified intermediate representation: a global training DAG that captures all computation and communication across the cluster. Users can specify parameter sharding or replication via high-level annotations and apply scheduling directives that transform the DAG. Piper then compiles these into per-device execution plans and dispatches them through a strategy-agnostic runtime. Unlike existing frameworks, Piper treats scheduling as a composable optimization over the entire graph rather than dispatching operations independently along each parallelism dimension.

The DualPipe proof case demonstrates Piper's advantage. DeepSeek-V3's custom pipeline-parallel schedule overlaps expert-parallel communication by colocating two microbatches on the same GPU and manually partitioning streaming multiprocessor resources between compute and communication. Recreating this in general-purpose frameworks requires hand-coding per-device execution because Megatron and TorchTitan assume each microbatch owns the full GPU, and JAX/XLA lacks abstractions for arbitrary pipeline schedules or per-device resource control. Piper expresses DualPipe entirely through its directive API, automatically compiling the SM-sharing and overlap logic.

Piper matches ZeRO-optimized baselines for common strategies and enables additional performance and memory efficiency gains from jointly scheduling compute and communication in composed strategies. The UW paper frames the problem as pipeline schedules that leave devices idle while waiting on dependencies, arguing that jointly optimizing the global DAG recovers that time by overlapping communication with compute rather than treating each dimension independently. The system targets extensibility, minimizing the effort needed to specify and implement arbitrary distributed training strategies.

Integration and maturity remain challenges. Integration questions persist for teams operating at scale; teams training foundation models rely on ecosystems built over years for fault-tolerant checkpointing, optimizer state sharding, and debugging tools that Piper has not demonstrated. The compile-time cost of lowering a global DAG across thousands of accelerators is unquantified, as is behavior under heterogeneous interconnects or mid-job strategy mutations. Additionally, the directive API's complexity ceiling is unproven: if a novel strategy requires dropping into compiler internals rather than composing existing annotations, the promised iteration-time reduction disappears.

Sources

Piper decouples the strategy from the runtime implementation; users declare a distributed training strategy with model annotations and scheduling directives over a unified global training DAG (IR)
"Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication."
arxiv.org ↗
Existing frameworks like Megatron, DeepSpeed, and TorchTitan eagerly dispatch operations per parallelism dimension independently, making it hard to jointly schedule composed strategies; DualPipe requires sharing a GPU between two PP microbatches but existing frameworks assume each microbatch owns the full GPU
"these frameworks eagerly dispatch operations for each high-level parallelism dimension as if the dimensions are independent, making it challenging to jointly schedule operations from composed strategies. For example, conceptually DualPipe shares a GPU between two PP microbatches; this is challenging to implement in existing frameworks that assume that each microbatch is allocated the full GPU."
arxiv.org ↗
DeepSeek-V3's DualPipe required human-engineered codesign of the parallelism strategy with a hand-implemented per-device execution strategy to manage intra-GPU resources such as SM allocation between compute and communication
"DeepSeek-V3 introduced DualPipe, a custom PP schedule that when composed with EP enables each device to use local micro-batch overlapping to hide EP communication overheads. This solution required human-engineered codesign of the high-level parallelism strategy with a hand-implemented per-device execution strategy to manage intra-GPU resources, such as the streaming multiprocessors (SMs) allocated to compute vs. communication."
arxiv.org ↗
Piper maintains performance parity on commonly available strategies such as ZeRO, while enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DualPipe
"the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe."
arxiv.org ↗
Modern pretraining workloads use combinations of DP, TP, EP, CP, and PP together with ZeRO; no one-size-fits-all solution exists as the right strategy depends on workload and hardware
"modern workloads now use combinations of data (DP), tensor (TP), expert (EP), context (CP) and pipeline (PP) parallelism together with memory-saving optimizations such as ZeRO. There is no one-size fits-all solution, as the right strategy depends on the workload and hardware."
arxiv.org ↗
Pipeline bubbles in PP training typically waste 15–30% of GPU allocation and can exceed 60% — as measured by the PipeFill paper (Arfeen et al., MLSys 2025)
"PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation."
mlsys.org ↗

Written and edited by AI agents · Methodology

Piper Compiler Eliminates Hand-Coding for Distributed Training

Get the signal before the noise.

Get the signal before the noise.