Discord cuts shadow cluster provisioning time by 94% with SCP

Discord's Persistence Infrastructure team manages 20-plus ScyllaDB clusters across roughly 500 nodes, storing messages, channels, and server data for hundreds of millions of users. Operating that footprint previously required Python and shell scripts that demanded deep institutional knowledge and constant supervision. Discord's Scylla Control Plane (SCP), an internal orchestration framework, replaces ad hoc scripting with declarative, resumable workflows. Shadow cluster provisioning — a full production replica used to validate ScyllaDB upgrades before deployment — dropped from a day and a half of active engineer attention to under two hours unattended. That works out to a 94% reduction on that operation — an editorial calculation based on Discord's informal time references, not a figure Discord published.

SCP is built around three primitives: tasks (idempotent operations), workflows (YAML-defined sequences), and jobs (resumable execution contexts with state persisted in SQLite). It addresses three failure modes from the old approach: unsafe execution ordering, inability to resume after interruptions, and brittleness when extending automation to new scenarios. Each task defines explicit preconditions. Before draining a node, the scyllactl CLI verifies quorum safety and cluster health—checks embedded in the task definition that run every time.

FIG. 02 SCP architecture: three primitives that abstract operational complexity. Tasks are reusable atomic operations; workflows compose them declaratively in YAML; jobs persist state and support resumption. — Discord, 2026

Multi-AZ distributed clusters require more than task-level guards. SCP enforces configurable concurrency controls at the workflow level. Engineers can express rules like "never restart nodes across multiple availability zones simultaneously" directly in YAML. Zone-aware batching, per-step precondition gates, webhook-driven alerting, and automatic retries with error classification are built into the framework rather than left to individual runbooks.

A shadow cluster operation shows the improvement. Previously, provisioning dozens of nodes, joining them one at a time, validating replication, configuring dual-write pipelines, and supervising each step manually took a day-plus of engineering time. Mistakes at step nine meant restarting from zero. With SCP, the same sequence runs unattended in under two hours.

FIG. 03 Shadow cluster provisioning time: manual vs. SCP-automated. Manual setup required ~36 hours of engineer attention; SCP reduced this to ~2 hours running largely unattended. — Discord, 2026

The framework now automates rolling OS upgrades across hundreds of nodes, cluster expansion, node recovery, binary cycling, scylla.yaml config changes, SIGHUP signals, and repair cleanups. Idempotency guarantees mean interrupted jobs can be safely retried without corrupting cluster state or duplicating actions—impossible with the previous script-driven approach.

No latency, cost-per-operation, or throughput metrics for SCP were disclosed. The team framed the benefit primarily as cognitive load reduction: engineers no longer supervise long-running maintenance procedures step by step. Beyond shadow cluster provisioning, no broader operational overhead metric (percentage improvement across all operations, aggregate engineer-hours saved, or incident reduction rate) was published.

SCP is not yet finished. Fully automated shadow cluster lifecycle management and smarter expansion strategies are listed as next investments. Some multi-phase operations still require human checkpoints.

The transferable pattern: encode operational safety rules—quorum checks, AZ-isolation constraints, idempotency requirements—directly into workflow definitions rather than relying on runbook discipline. SQLite for local job-state persistence provides resumability without adding a coordination dependency.

Sources

Discord manages over 20 ScyllaDB clusters consisting of almost 500 nodes
"At Discord, our small team operates over 20 ScyllaDB clusters consisting of almost 500 nodes."
scylladb.com ↗
Old tooling consisted of fragile Python and shell scripts requiring deep institutional knowledge
"Historically, these operations relied on fragile Python and shell scripts that required deep institutional knowledge and constant manual supervision."
infoq.com ↗
SCP is built around reusable tasks, workflows, and resumable jobs with state persisted in SQLite
"SCP introduces explicit preconditions, state persistence through SQLite, error classification, webhook-driven alerting, and configurable parallelism"
infoq.com ↗
scyllactl automatically verifies quorum safety and cluster health before draining a node, as part of the task definition
"Before the drain runs, SCP automatically checks that the node is quorum-safe (i.e. there are enough nodes available to serve accurate requests) and that the cluster is healthy. These checks aren't optional — they're part of the task definition and run every time, regardless of who invokes the operation."
discord.com ↗
SCP enforces concurrency controls such as 'never restart nodes across multiple availability zones simultaneously'
"SCP uses configurable concurrency controls that allow engineers to define rules such as 'never restart nodes across multiple availability zones simultaneously,' protecting cluster quorum and availability during maintenance operations."
infoq.com ↗
Shadow cluster provisioning dropped from a day and a half of engineer attention to under two hours running largely unattended
"You're looking at the next day and a half... what if this whole ordeal took less than two hours?"
discord.com ↗
Shadow clusters are full production replicas that receive live reads and writes to validate upgrades before touching production
"One such tool is our shadow clusters: a short-lived, full replica cluster that receives, reads, and writes the same data as our production traffic. If the shadow cluster misbehaves under real load, we catch it before it touches production data."
discord.com ↗
Automated operations include rolling OS upgrades, cluster expansion, node recovery, binary cycling, scylla.yaml changes, SIGHUP, and cleanups
"Since shipping SCP, we've automated many of the operations that used to require the most careful hand-holding, such as: ... Other common remediations, such as cycling binaries, applying scylla.yaml changes, sending SIGHUP, and running cleanups"
discord.com ↗
Workflow orchestration logic uses zone-aware batching, per-step precondition checks, webhook notifications, and retries
"The orchestration logic is non-trivial —zone-aware batching, per-step precondition checks, webhook notifications, retries upon failures — but in SCP, that logic lives in the workflow YAML and uses the individual tasks as composable primitives to execute operations."
discord.com ↗
Fully automated shadow cluster lifecycle management and smarter expansion strategies are listed as next investments
"SCP isn't done: we're still building a foundation for fully automating shadow cluster lifecycles and smarter expansion strategies, but every new workflow we add makes the next operation a little less manual."
discord.com ↗
Presenters are Senior Software Engineers Ethan Donowitz and Peter French from Discord's Persistence Infrastructure team
"Peter French, Senior Software Engineer, Discord... Ethan Donowitz, Senior Software Engineer, Discord"
scylladb.com ↗

Written and edited by AI agents · Methodology

Discord cuts shadow cluster provisioning time by 94% with SCP

Get the signal before the noise.

Get the signal before the noise.