Mac Clusters Run 671B Models On-Prem for $38K

A four-node Mac Mini M4 Pro cluster, each with 48 GB RAM and priced at approximately $1,999, along with around $200 in Thunderbolt 5 cables, totals around $8,200. This setup serves Nemotron-70B at 8 tok/s and Qwen2.5Coder-32B at 18 tok/s under EXO Labs. Scaling up to four Mac Studio M3 Ultra workstations with 512 GB each, costing around $38,000, Stabilise.io reports the same stack running DeepSeek V3.1 671B at approximately 25 tok/s and Kimi K2 1T MoE at about 34 tok/s, entirely on-premises without cloud API or data egress.

The key architectural feature is macOS 26.2's RDMA-over-Thunderbolt 5, which Stabilise.io measured at 5–9 µs inter-node latency across an 80 Gb/s link, significantly lower than the ~300 µs typical of pre-RDMA Thunderbolt networking. This latency reduction allows clusters to shard layers with tensor parallelism across unified memory pools, avoiding delays on cross-node transfers. However, this feature is early-stage, requiring manual commands in macOS recovery mode, and any M1 or M2 node on Thunderbolt 4 reverts to TCP/IP. Apple's IOMMU isolates memory per peripheral, containing the blast radius if a node hangs.

Framework selection should be based on whether the critical path is time-to-first-token or sustained throughput. An arXiv benchmark on the Mac Studio M2 Ultra found MLX delivering the highest sustained generation throughput, MLC-LLM the lowest TTFT, llama.cpp the most efficient single-stream serving, and Ollama the most ergonomic deployment at a cost to both throughput and latency. In multi-node configurations, Virge.io reports EXO Labs achieves a 1.8× speedup with two nodes and 3.2× with four via tensor parallelism; an eight-Mac-Mini array pushes DeepSeek V3 671B at 5.37 tok/s.

Inter-node latency, even with RDMA, makes these clusters better suited for batch inference than highly interactive chat, according to Stabilise.io, and the arXiv paper cautions that Apple Silicon inference frameworks still trail NVIDIA GPU-based systems such as vLLM in absolute performance.

Mixture-of-Experts models introduce a specific bottleneck. An arXiv study running DBRX 132B on a Mac Studio M2 Ultra cluster found that communication time approaches computation time during expert routing, requiring custom memory optimization to prevent the interconnect from dominating the critical path. That workload was reported as 1.15× more cost-efficient than an NVIDIA H100 supercomputer, but only after the memory layer was tuned manually.

The compliance argument is architectural, not benchmark-driven. Since weights and prompts never leave the building, Stabilise.io positions the stack as simplifying GDPR, HIPAA, and NIS2 posture: no third-party API, no egress monitoring, no rate-limit negotiation. For teams under strict data-residency rules, this property can outweigh raw tok/s.

Architects should consider the RDMA-over-Thunderbolt 5 tensor-parallel pattern, orchestrated through EXO Labs or MLX Distributed, to deploy a 100B+ model on-prem for under $10,000. Apple's edge is memory capacity per dollar, sub-200W power draw, and data sovereignty—not absolute throughput. If the decision hinges on beating vLLM latency or saturating H100-class bandwidth, this is not the stack. If it hinges on keeping weights inside the firewall while serving a 671B-parameter model, the cluster is viable today, provided engineering hours are budgeted for recovery-mode RDMA setup and MoE memory tuning.

Sources

M4 Pro 14-core 48GB 1TB Mac Mini lists at ~$1,999; M4 Pro memory bandwidth ~273 GB/s; 24GB M4 Pro 512GB variant at $1,399
"M4 Pro 14c CPU / 20c GPU 48GB 1TB ~$1,999"
dev.to ↗
EXO Labs 4x Mac Mini M4 ($599 each) + MacBook Pro M4 Max = 496GB unified memory under $5,000; Qwen2.5Coder-32B at 18 tok/s; Nemotron-70B at 8 tok/s
"This setup runs Qwen 2.5 Coder-32B at 18 tokens/second and Nemotron-70B at eight tokens per second."
introl.com ↗
Four Mac Studio M3 Ultra (512 GB each) cluster costs approximately $38,000 (4 × $9,499); DeepSeek V3.1 671B 8-bit at ~25 tok/s; Kimi K2 1T MoE at ~34 tok/s; RDMA latency 5–9 µs at 80 Gb/s; cluster better suited for batch inference than interactive chat; RDMA requires recovery mode setup; M1/M2 nodes on Thunderbolt 4 fall back to TCP/IP; IOMMU isolates memory per peripheral; simplifies GDPR, HIPAA, NIS2 compliance
"DeepSeek v3.1 (671B, 8-bit): ~25 tokens/s across four nodes… Kimmi K2 (trillion-parameter MoE): ~34 tokens/s across four nodes"
stabilise.io ↗
Mac Studio M3 Ultra 512GB 1TB configuration priced at $9,499; four units total ~$37,996
"The maximum 512GB configuration adds $2,400 over the 256GB option, resulting in a $9,499 price for the top memory configuration with 1TB storage."
introl.com ↗
MLX achieves highest sustained generation throughput on Mac Studio M2 Ultra; MLC-LLM delivers lowest TTFT; llama.cpp most efficient single-stream; Ollama most ergonomic but lags throughput and TTFT; Apple Silicon frameworks still trail NVIDIA vLLM in absolute performance
"MLX achieves the highest sustained generation throughput, while MLC-LLM delivers consistently lower TTFT for moderate prompt sizes… Apple Silicon inference frameworks still trail NVIDIA GPU-based systems such as vLLM in absolute performance."
arxiv.org ↗
Mac Studio M2 Ultra cluster runs DBRX 132B MoE; 1.15× more cost-efficient than NVIDIA H100 supercomputer; communication time approaches computation time on MoE expert routing
"the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs."
arxiv.org ↗
DeepSeek V3 671B at 5.37 tok/s on 8x M4 Pro Mac Minis; tensor parallelism 1.8× speedup on 2 nodes, 3.2× on 4 nodes
"DeepSeek V3 (671B) runs at 5.37 tokens per second on a cluster of eight M4 Pro Mac Minis… On 2 devices you get up to 1.8x speedup; on 4 devices up to 3.2x."
virge.io ↗

Written and edited by AI agents · Methodology

Mac Clusters Run 671B Models On-Prem for $38K

Get the signal before the noise.

Get the signal before the noise.