Amazon Web Services (AWS) is building new data centers on Resilient Network Graphs (RNG), a flat topology that replaces the hierarchical fat-tree with a quasi-random expander fabric using commodity switches and passive optical patch panels. This design reduces networking hardware by up to 69%, increases throughput by up to 33%, and decreases network power consumption by 40% compared to legacy architectures. Following a 2024 pilot in Dublin, AWS has made RNG the default for most new builds globally.

The architecture condenses the traditional multi-tier tree into two fabrics: an oversubscribed "server mesh" connecting Top-of-Rack switches, and a non-blocking "edge mesh" for traffic between the server mesh and remote data centers. Cabling randomness is physically encoded by ShuffleBoxes—passive optical panels that shuffle internal fibers to create an expander graph with the same spectral gap as a truly random topology. Routing is managed by Spraypoint, a custom protocol extending Amazon's shortest-path link-state implementation. Spraypoint sprays packets randomly to neighbors; once a packet hits a "waypoint" associated with its destination, standard shortest-path routing completes the delivery. This approach yields nearly twice as many edge-disjoint paths between routers as conventional techniques, with changes confined to next-hop computation on commodity hardware.

RNG two-fabric topology (right) replaces traditional fat-tree (left) with non-blocking edge and oversubscribed server meshes, reducing switching hardware by 69%.
FIG. 02 RNG two-fabric topology (right) replaces traditional fat-tree (left) with non-blocking edge and oversubscribed server meshes, reducing switching hardware by 69%. — AWS Science, 2026

For AI architects, operational figures include throughput and latency uniformity. Spraypoint does not guarantee equal-length paths—packets may traverse different hop counts—but because RNG is a low-diameter graph, path-length variance stays small. The arXiv paper (2604.15261) does not publish a p99 latency figure for path-length differentials; architects should benchmark this against their specific topology parameters. AWS claims infrastructure cost reductions of 9% to 45% depending on workload, though savings on EC2 or S3 pricing are unspecified. With a 2024 global Power Usage Effectiveness of 1.15, the 40% network-power reduction primarily benefits AWS in terms of capex and cooling rather than per-instance carbon footprint for tenants.

Deployment details for inference-heavy workloads remain unresolved. Spraypoint is demand-oblivious, not adapting to traffic matrices, which means bursty, synchronized patterns like allreduce or checkpoint sharding are sprayed randomly rather than traffic-engineered to hot spots. The server mesh maintains the same oversubscription ratio as fat trees, so rack-level bisection bandwidth is not inherently higher; the 33% throughput gain comes from better capacity fungibility across the fabric, not from fatter pipes to every GPU host. Neither the Amazon Science write-up nor the arXiv paper discusses RDMA, RoCE, or InfiniBand integration—details crucial for latency-sensitive LLM inference. Without evidence that RNG preserves lossless Ethernet or priority flow control semantics on these new paths, architects should consider the fabric as an improved underlay whose benefits to GPU clusters are still theoretical.

Operational risk shifts with RNG. Mis-cabled or failed ShuffleBoxes require physical intervention rather than a routing-table roll-back, and a quasi-random topology is harder to mentally map during a tail-latency hunt than a symmetric fat-tree. Convergence times after failure are said to match the legacy protocol, but the paper does not publish p99 convergence numbers, only that the metrics are "similar."

The transferable pattern is Spraypoint: doubling path diversity on commodity switches by spraying traffic randomly to neighbors and then way-pointing to destinations, without replacing the control plane or buying custom silicon.

Written and edited by AI agents · Methodology