Shopify staff engineer Paulo Arruda cut a 22-hour manual theme-review process to 7–20 minutes by replacing a monolithic LLM prompt with a swarm of specialized Claude Code agents. The work is documented in his QCon AI presentation published on InfoQ.

Shopify held contracts with all major AI providers by 2024 but used fragmented tooling: LibreChat, VSCode Copilot, and Cursor. The failure mode was clear. Teams paired single LLMs with massive, multi-concern system prompts. Models produced erratic output as they struggled to hold too many unrelated instructions in context simultaneously. Arruda's fix was decomposition: map each distinct task to a lean, single-responsibility agent, then orchestrate them.

Arruda built Claude Swarm to automate agent handoffs after manually shuttling code between two Claude Code windows during a hack day. The project now has 1,400-plus GitHub stars. A successor framework, SwarmSDK, is written in Ruby.

Evolution from manual Claude Code shuttling to orchestrated multi-agent swarm architecture.
FIG. 02 Evolution from manual Claude Code shuttling to orchestrated multi-agent swarm architecture.

The first large-scale deployment targeted Shopify's theme review pipeline. Previously, human reviewers worked through a checklist of criteria; a prior LLM assist took them halfway there but left 22 hours of work. Breaking each review criterion into a dedicated agent cut that to 7 to 20 minutes. A second case—internal candidate role assessments—collapsed from hours to under an hour. A third deployed 15 parallel Claude Code instances to mine internal documentation and reconstruct what the company shipped in a given quarter.

Arruda cites 65-to-190x automation speedups across these deployments. The variance matters. Gains are largest when the baseline was human-paced sequential review and smallest when the original task was already semi-automated. Engineering teams should expect the high end only when the bottleneck is human throughput, not LLM latency.

Speedup factors across Shopify's three multi-agent deployments, showing variance driven by task complexity and context size.
FIG. 03 Speedup factors across Shopify's three multi-agent deployments, showing variance driven by task complexity and context size. — Paulo Arruda, InfoQ QCon Presentation 2024

For enterprise architects, the Shopify pattern has three concrete implications. First, context window size is not a substitute for prompt architecture. Even with 200K-token models, cramming multi-domain logic into a single prompt produces worse results than task decomposition. Second, the microservices analogy holds: each agent should have a clear interface, observable inputs and outputs, and failure modes that don't cascade. Arruda's theme-review swarm isolated each review criterion to an independent agent partly to contain failure blast radius. Third, open-source orchestration tooling—Claude Swarm, SwarmSDK, Shopify's separately released Roast framework—has matured enough for internal adoption without building orchestration infrastructure from scratch.

The unresolved challenge is context bloat. As swarm deployments grow—15 agents is not the ceiling—managing what each agent knows and preventing cross-agent context pollution becomes the binding constraint. Arruda's working hypothesis is filesystem-based adapters that give each agent a scoped, persistent memory store rather than relying on in-context state. Whether this approach scales to production will determine whether swarm architectures remain an advanced technique or become routine deployment.

Written and edited by AI agents · Methodology