Discord's Persistence Infrastructure team manages 20-plus ScyllaDB clusters across roughly 500 nodes, storing messages, channels, and server data for hundreds of millions of users. Operating that footprint previously required Python and shell scripts that demanded deep institutional knowledge and constant supervision. Discord's Scylla Control Plane (SCP), an internal orchestration framework, replaces ad hoc scripting with declarative, resumable workflows. Shadow cluster provisioning — a full production replica used to validate ScyllaDB upgrades before deployment — dropped from a day and a half of active engineer attention to under two hours unattended. That works out to a 94% reduction on that operation — an editorial calculation based on Discord's informal time references, not a figure Discord published.

SCP is built around three primitives: tasks (idempotent operations), workflows (YAML-defined sequences), and jobs (resumable execution contexts with state persisted in SQLite). It addresses three failure modes from the old approach: unsafe execution ordering, inability to resume after interruptions, and brittleness when extending automation to new scenarios. Each task defines explicit preconditions. Before draining a node, the scyllactl CLI verifies quorum safety and cluster health—checks embedded in the task definition that run every time.

SCP architecture: three primitives that abstract operational complexity. Tasks are reusable atomic operations; workflows compose them declaratively in YAML; jobs persist state and support resumption.
FIG. 02 SCP architecture: three primitives that abstract operational complexity. Tasks are reusable atomic operations; workflows compose them declaratively in YAML; jobs persist state and support resumption. — Discord, 2026

Multi-AZ distributed clusters require more than task-level guards. SCP enforces configurable concurrency controls at the workflow level. Engineers can express rules like "never restart nodes across multiple availability zones simultaneously" directly in YAML. Zone-aware batching, per-step precondition gates, webhook-driven alerting, and automatic retries with error classification are built into the framework rather than left to individual runbooks.

A shadow cluster operation shows the improvement. Previously, provisioning dozens of nodes, joining them one at a time, validating replication, configuring dual-write pipelines, and supervising each step manually took a day-plus of engineering time. Mistakes at step nine meant restarting from zero. With SCP, the same sequence runs unattended in under two hours.

Shadow cluster provisioning time: manual vs. SCP-automated. Manual setup required ~36 hours of engineer attention; SCP reduced this to ~2 hours running largely unattended.
FIG. 03 Shadow cluster provisioning time: manual vs. SCP-automated. Manual setup required ~36 hours of engineer attention; SCP reduced this to ~2 hours running largely unattended. — Discord, 2026

The framework now automates rolling OS upgrades across hundreds of nodes, cluster expansion, node recovery, binary cycling, scylla.yaml config changes, SIGHUP signals, and repair cleanups. Idempotency guarantees mean interrupted jobs can be safely retried without corrupting cluster state or duplicating actions—impossible with the previous script-driven approach.

No latency, cost-per-operation, or throughput metrics for SCP were disclosed. The team framed the benefit primarily as cognitive load reduction: engineers no longer supervise long-running maintenance procedures step by step. Beyond shadow cluster provisioning, no broader operational overhead metric (percentage improvement across all operations, aggregate engineer-hours saved, or incident reduction rate) was published.

SCP is not yet finished. Fully automated shadow cluster lifecycle management and smarter expansion strategies are listed as next investments. Some multi-phase operations still require human checkpoints.

The transferable pattern: encode operational safety rules—quorum checks, AZ-isolation constraints, idempotency requirements—directly into workflow definitions rather than relying on runbook discipline. SQLite for local job-state persistence provides resumability without adding a coordination dependency.

Written and edited by AI agents · Methodology