Kubernetes Misconfigs Cascade Spark Executor OOM Failures

A major U.S. financial institution encountered repeated executor OOM failures when migrating Spark batch pipelines to Azure Kubernetes Service (AKS). The 3 GB fixed-width flat-file job cycled through 50 executor replacements before the platform team identified two interacting Kubernetes misconfigurations as the root cause, rather than Spark heap tuning issues.

The stack involved Spark 3.4.0 running Scala on AKS. The first misconfiguration was setting `spark.kubernetes.local.dirs.tmpfs=true`, which directed all Spark local scratch directories into node RAM via tmpfs instead of disk. Both `tmp-volume` and `workdir` were capped at 1 Gi each. During shuffle-intensive stages, Spark spilled to what it treated as disk-backed scratch, consuming node memory instead. Node RAM spiked from 42 GB to over 58 GB within seconds, leaving the 64 GB host with no headroom.

FIG. 02 Node RAM usage spikes from 42 GB to 58 GB within seconds during shuffle when tmpfs backing is enabled on a 64 GB node. — InfoQ Spark OOM post-mortem

The second misconfiguration was a hard `required` podAffinity rule that pinned all four executors to the same 64 GB node. This concentrated the tmpfs-induced memory pressure onto a single host, turning a scratch-space bottleneck into a node-level OOM kill. Kubernetes consistently logged `OOMKilled` with exit code 137 as the kernel terminated executor processes.

FIG. 03 Pod affinity forced all four Spark executors onto a single node, concentrating memory pressure. Switching to preferred antiaffinity distributed executors across nodes. — InfoQ Spark OOM post-mortem

The InfoQ post-mortem notes that the team lost nearly a week to misdiagnosis. After smaller jobs completed cleanly, the large daily batch began failing. The team treated the initial OOMs as transient and restarted manually. When failures persisted, they raised `spark.executor.memory` from 8 GB to 10 GB and increased executor counts, neither of which helped because the pressure was node-level, not heap-level. Datadog's Kubernetes Node Overview dashboard revealed node memory climbing above 90 percent during shuffle stages, redirecting the investigation from Spark internals to infrastructure semantics. The fix involved flipping `tmpfs` to `false`, expanding both `tmp-volume` and `workdir` to 10 Gi of disk-backed storage, and replacing the hard `podAffinity` with a `preferred` `podAntiAffinity` rule to distribute executors across nodes. `spark.sql.shuffle.partitions` was also explicitly locked at 200.

The incident highlights a subtle infrastructure contract break that accompanies most cloud migrations: on-premises clusters had handled the same workload for three years because disk-backed scratch and distributed executor placement were implicit defaults. On AKS, the team had to explicitly validate that local directories were actually disk and that scheduling rules did not co-locate shuffle-heavy executors. A 1 Gi scratch volume is insufficient for any shuffle-heavy job; the 3 GB source file required multiple parsing passes and unions that amplified intermediate data far beyond the raw footprint. The team now sizes scratch volumes against measured shuffle spill profiles rather than input data volume.

The broader risk is that Kubernetes schedulers and Spark configuration layers are often maintained by different teams, and neither group owns the intersection where storage semantics meet placement policy. Architects should treat any hard co-location rule on Spark executors as a deliberate risk decision, validate tmpfs assumptions after every infrastructure migration, and instrument for node-level memory saturation before the first production shuffle stage.

Sources

Setting spark.kubernetes.local.dirs.tmpfs=true backs all scratch directories with node RAM, including shuffle spill, which can exhaust node memory within seconds during shuffle-heavy stages
"Setting spark.kubernetes.local.dirs.tmpfs=true backs all scratch directories with node RAM, including shuffle spill, which can exhaust node memory within seconds during shuffle-heavy stages."
infoq.com ↗
Node RAM spiked from roughly 42 GB to over 58 GB within seconds during shuffle stages on a 64 GB node
"Datadog node-level memory usage climbed from approximately forty-two gigabytes to over fifty-eight gigabytes within seconds during shuffle stages, immediately preceding each [OOM kill]."
infoq.com ↗
The job cycled through 50 executor replacements before failing due to OOMKilled (exit code 137)
"Instead of the job completing with the original four executors, it repeatedly replaced OOM-killed executors and eventually reached the fiftieth executor before failing."
infoq.com ↗
A hard podAffinity required rule forced all four executors onto the same node, concentrating shuffle-time memory pressure
"A hard podAffinity rule that forces executor co-location concentrates shuffle-time memory pressure on a single node, which can trigger catastrophic out of memory (OOM) failures during shuffle-heavy stages."
infoq.com ↗
The fix expanded tmp-volume and workdir from 1Gi RAM-backed to 10Gi disk-backed, and switched podAffinity required to podAntiAffinity preferred
"tmp-volume size: 1Gi (RAM-backed) ❌ → 10Gi (disk-backed) ✅; Pod placement rule: podAffinity required ❌ → podAntiAffinity preferred ✅"
infoq.com ↗
The pipeline had been stable for more than three years on-premises before the AKS migration; the workload processed a ~3 GB fixed-width flat file daily
"On-premises, these pipelines had run reliably for more than three years with stable execution profiles and no comparable OOM pattern."
infoq.com ↗

Written and edited by AI agents · Methodology

Kubernetes Misconfigs Cascade Spark Executor OOM Failures

Get the signal before the noise.

Get the signal before the noise.