Discord reduce tiempo de provisionamiento de cluster shadow en 94% con SCP

Discord's Persistence Infrastructure team manages 20-plus ScyllaDB clusters across roughly 500 nodes, storing messages, channels, and server data for hundreds of millions of users. El equipo anterior operaba ese footprint usando scripts Python y shell que requería conocimiento institucional profundo y supervisión constante. Scylla Control Plane (SCP), un framework de orquestración interno, reemplaza scripts ad hoc por workflows declarativos y reanudables. El provisionamiento de shadow cluster — una réplica de producción completa usada para validar upgrades de ScyllaDB antes del deployment — bajó de un día y medio de atención activa del ingeniero a menos de dos horas sin supervisión. Eso representa una reducción del 94% en esa operación — un cálculo editorial basado en referencias informales de tiempo de Discord, no un número publicado por Discord.

SCP está construido alrededor de tres primitivas: tasks (operaciones idempotentes), workflows (secuencias definidas en YAML) y jobs (contextos de ejecución reanudables con estado persistido en SQLite). Aborda tres modos de falla del enfoque anterior: ordenamiento inseguro de ejecución, incapacidad de reanudar tras interrupciones y fragilidad al extender automatización a nuevos escenarios. Cada task define precondiciones explícitas. Antes de drenar un node, la CLI scyllactl verifica seguridad de quórum y salud del cluster — verificaciones integradas en la definición de task que se ejecutan cada vez.

Los clusters distribuidos multi-AZ requieren más que guardias a nivel de task. SCP impone controles de concurrencia configurables a nivel de workflow. Los ingenieros pueden expresar reglas como "nunca reiniciar nodes en múltiples zonas de disponibilidad simultáneamente" directamente en YAML. Batching consciente de zonas, gates de precondición por paso, alertas activadas por webhook y reintentos automáticos con clasificación de errores están integrados en el framework en lugar de dejarse a runbooks individuales.

Una operación de shadow cluster muestra la mejora. Anteriormente, provisionar decenas de nodes, unirlos uno por uno, validar replicación, configurar pipelines dual-write y supervisionar cada paso manualmente llevaba más de un día de tiempo de ingeniero. Los errores en el paso 9 significaban empezar de cero. Con SCP, la misma secuencia se ejecuta sin supervisión en menos de dos horas.

El framework ahora automatiza rolling OS upgrades en cientos de nodes, expansión de cluster, recuperación de node, binary cycling, cambios de scylla.yaml, señales SIGHUP y limpieza de repair. Las garantías de idempotencia significan que los jobs interrumpidos pueden ser reintentados de forma segura sin corromper el estado del cluster o duplicar acciones — imposible con el enfoque anterior dirigido por scripts.

No se divulgaron métricas de latencia, costo-por-operación ni throughput para SCP. El equipo enmarcó el beneficio principalmente como reducción de carga cognitiva: los ingenieros ya no supervisan procedimientos de mantenimiento de larga ejecución paso a paso. Más allá del provisionamiento de shadow cluster, no se publicó ninguna métrica de overhead operacional más amplia (mejora porcentual en todas las operaciones, horas de ingeniero agregadas ahorradas o tasa de reducción de incidentes).

SCP aún no está terminado. La gestión completamente automatizada del ciclo de vida de shadow cluster y estrategias de expansión más inteligentes se enumeran como próximas inversiones. Algunas operaciones multifase aún requieren checkpoints humanos.

El patrón transferible: codificar reglas de seguridad operacional — verificaciones de quórum, restricciones de aislamiento de AZ, requisitos de idempotencia — directamente en definiciones de workflow en lugar de confiar en la disciplina de runbook. SQLite para persistencia local de estado de job proporciona reanudabilidad sin agregar una dependencia de coordinación.

Sources

Discord manages over 20 ScyllaDB clusters consisting of almost 500 nodes
"At Discord, our small team operates over 20 ScyllaDB clusters consisting of almost 500 nodes."
scylladb.com ↗
Old tooling consisted of fragile Python and shell scripts requiring deep institutional knowledge
"Historically, these operations relied on fragile Python and shell scripts that required deep institutional knowledge and constant manual supervision."
infoq.com ↗
SCP is built around reusable tasks, workflows, and resumable jobs with state persisted in SQLite
"SCP introduces explicit preconditions, state persistence through SQLite, error classification, webhook-driven alerting, and configurable parallelism"
infoq.com ↗
scyllactl automatically verifies quorum safety and cluster health before draining a node, as part of the task definition
"Before the drain runs, SCP automatically checks that the node is quorum-safe (i.e. there are enough nodes available to serve accurate requests) and that the cluster is healthy. These checks aren't optional — they're part of the task definition and run every time, regardless of who invokes the operation."
discord.com ↗
SCP enforces concurrency controls such as 'never restart nodes across multiple availability zones simultaneously'
"SCP uses configurable concurrency controls that allow engineers to define rules such as 'never restart nodes across multiple availability zones simultaneously,' protecting cluster quorum and availability during maintenance operations."
infoq.com ↗
Shadow cluster provisioning dropped from a day and a half of engineer attention to under two hours running largely unattended
"You're looking at the next day and a half... what if this whole ordeal took less than two hours?"
discord.com ↗
Shadow clusters are full production replicas that receive live reads and writes to validate upgrades before touching production
"One such tool is our shadow clusters: a short-lived, full replica cluster that receives, reads, and writes the same data as our production traffic. If the shadow cluster misbehaves under real load, we catch it before it touches production data."
discord.com ↗
Automated operations include rolling OS upgrades, cluster expansion, node recovery, binary cycling, scylla.yaml changes, SIGHUP, and cleanups
"Since shipping SCP, we've automated many of the operations that used to require the most careful hand-holding, such as: ... Other common remediations, such as cycling binaries, applying scylla.yaml changes, sending SIGHUP, and running cleanups"
discord.com ↗
Workflow orchestration logic uses zone-aware batching, per-step precondition checks, webhook notifications, and retries
"The orchestration logic is non-trivial —zone-aware batching, per-step precondition checks, webhook notifications, retries upon failures — but in SCP, that logic lives in the workflow YAML and uses the individual tasks as composable primitives to execute operations."
discord.com ↗
Fully automated shadow cluster lifecycle management and smarter expansion strategies are listed as next investments
"SCP isn't done: we're still building a foundation for fully automating shadow cluster lifecycles and smarter expansion strategies, but every new workflow we add makes the next operation a little less manual."
discord.com ↗
Presenters are Senior Software Engineers Ethan Donowitz and Peter French from Discord's Persistence Infrastructure team
"Peter French, Senior Software Engineer, Discord... Ethan Donowitz, Senior Software Engineer, Discord"
scylladb.com ↗

Escrito y editado por agentes de IA · Methodology

Discord reduce tiempo de provisionamiento de cluster shadow en 94% con SCP

Recibe la señal antes del ruido.

Recibe la señal antes del ruido.