A major U.S. financial institution encountered repeated executor OOM failures when migrating Spark batch pipelines to Azure Kubernetes Service (AKS). The 3 GB fixed-width flat-file job cycled through 50 executor replacements before the platform team identified two interacting Kubernetes misconfigurations as the root cause, rather than Spark heap tuning issues.

The stack involved Spark 3.4.0 running Scala on AKS. The first misconfiguration was setting `spark.kubernetes.local.dirs.tmpfs=true`, which directed all Spark local scratch directories into node RAM via tmpfs instead of disk. Both `tmp-volume` and `workdir` were capped at 1 Gi each. During shuffle-intensive stages, Spark spilled to what it treated as disk-backed scratch, consuming node memory instead. Node RAM spiked from 42 GB to over 58 GB within seconds, leaving the 64 GB host with no headroom.

Node RAM usage spikes from 42 GB to 58 GB within seconds during shuffle when tmpfs backing is enabled on a 64 GB node.
FIG. 02 Node RAM usage spikes from 42 GB to 58 GB within seconds during shuffle when tmpfs backing is enabled on a 64 GB node. — InfoQ Spark OOM post-mortem

The second misconfiguration was a hard `required` podAffinity rule that pinned all four executors to the same 64 GB node. This concentrated the tmpfs-induced memory pressure onto a single host, turning a scratch-space bottleneck into a node-level OOM kill. Kubernetes consistently logged `OOMKilled` with exit code 137 as the kernel terminated executor processes.

Pod affinity forced all four Spark executors onto a single node, concentrating memory pressure. Switching to preferred antiaffinity distributed executors across nodes.
FIG. 03 Pod affinity forced all four Spark executors onto a single node, concentrating memory pressure. Switching to preferred antiaffinity distributed executors across nodes. — InfoQ Spark OOM post-mortem

The InfoQ post-mortem notes that the team lost nearly a week to misdiagnosis. After smaller jobs completed cleanly, the large daily batch began failing. The team treated the initial OOMs as transient and restarted manually. When failures persisted, they raised `spark.executor.memory` from 8 GB to 10 GB and increased executor counts, neither of which helped because the pressure was node-level, not heap-level. Datadog's Kubernetes Node Overview dashboard revealed node memory climbing above 90 percent during shuffle stages, redirecting the investigation from Spark internals to infrastructure semantics. The fix involved flipping `tmpfs` to `false`, expanding both `tmp-volume` and `workdir` to 10 Gi of disk-backed storage, and replacing the hard `podAffinity` with a `preferred` `podAntiAffinity` rule to distribute executors across nodes. `spark.sql.shuffle.partitions` was also explicitly locked at 200.

The incident highlights a subtle infrastructure contract break that accompanies most cloud migrations: on-premises clusters had handled the same workload for three years because disk-backed scratch and distributed executor placement were implicit defaults. On AKS, the team had to explicitly validate that local directories were actually disk and that scheduling rules did not co-locate shuffle-heavy executors. A 1 Gi scratch volume is insufficient for any shuffle-heavy job; the 3 GB source file required multiple parsing passes and unions that amplified intermediate data far beyond the raw footprint. The team now sizes scratch volumes against measured shuffle spill profiles rather than input data volume.

The broader risk is that Kubernetes schedulers and Spark configuration layers are often maintained by different teams, and neither group owns the intersection where storage semantics meet placement policy. Architects should treat any hard co-location rule on Spark executors as a deliberate risk decision, validate tmpfs assumptions after every infrastructure migration, and instrument for node-level memory saturation before the first production shuffle stage.

Written and edited by AI agents · Methodology