Micro-batch streaming sidesteps Kafka for search index freshness

A production post-mortem published May 2026 on InfoQ documents how a search-and-ads retrieval pipeline migrated from scheduled batch jobs to Spark Structured Streaming in micro-batch mode. Author Parveen Saini's account reveals where the architecture cracked under production load: the team discovered that scheduling delay and orchestration overhead, not computation, posed the primary bottleneck. The standard toolkit for solving that problem carried its own failure modes.

The pipeline maintained a Solr-backed inverted index covering several million documents. Full rebuilds took two to three hours, with validation and deployment pushing total turnaround to roughly five hours, making frequent full rebuilds impractical. The delta pipeline ingested new ads, campaign updates, and behavioral signals like co-purchase data. It ran on time-partitioned files in S3-style object storage, receiving new incremental data every five to seven minutes. Each delta run covered the last five hours of partitions, with multiple runs per hour expected. A staleness gap of one scheduling cycle translated directly into delayed ad activation and stale retrieval results.

FIG. 02 Old batch pipeline (left) rebuilt full index every 5 hours; new micro-batch (right) updates index every 5–7 minutes. — InfoQ production post-mortem, May 2026

The team converted the batch jobs to continuously running micro-batches using Spark Structured Streaming. They did not rely on Spark's native checkpointing or event-time watermark semantics. The pipeline advanced on partition-level progress rather than ordered event streams. The team maintained an external logical watermark tracking the latest processed partition by timestamp. Progress was determined by listing and interpreting partitioned data in object storage, not by consuming an ordered log.

Two failure categories dominated the post-mortem. First, S3 eventual consistency made completion markers and success-file patterns unreliable as signals that a partition was ready to process. The team adopted deterministic, rate-based progress — advancing by time rather than waiting for an explicit "done" signal. This approach held under production variance. Second, lag and restart semantics required explicit design rather than inheritance from the framework. In a freshness-driven pipeline with overlapping window semantics, replaying the full backlog after a restart degraded freshness further. The fix: skip directly to the latest available partition on restart, treating missed intermediate states as acceptable loss for immediate freshness recovery.

For enterprise data architects, the structural implication is direct. Teams migrating batch pipelines to streaming often default to Kafka or a managed streaming service. This post-mortem argues that for object-store-based pipelines — which cover a large share of enterprise data infrastructure — that migration introduces per-record operational complexity without delivering meaningful latency improvement. The scheduling delay, not the processing model, drives freshness lag. Micro-batch over existing object storage with explicit external watermark management closes that gap while keeping operational surface area close to batch infrastructure teams already know.

The second implication targets ML inference pipelines. Teams piping feature data or retrieval indexes to LLM inference endpoints face the same freshness-vs-complexity tradeoff. The report's finding that long-running streaming jobs should treat restarts as normal operations — not failure conditions — applies directly to any continuously running feature engineering or embedding refresh job feeding a model serving layer.

Open questions remain: whether the external watermark approach scales to multi-tenant environments where partition ownership is shared, and how the architecture interacts with Delta Lake's transaction log when used as the storage layer rather than raw S3. The case study is scoped to a fixed-function pipeline. Saini explicitly notes the design choices are constraint-specific.

The takeaway for platform teams: if streaming migration is stalled on Kafka adoption, it may be because Kafka is not required. Replacing a scheduler with a continuously running micro-batch job and an external cursor is often the right answer — provided you design the failure path before the first production incident forces you to.

Sources

Primary source: production post-mortem on migrating batch jobs to Spark Structured Streaming micro-batch mode
"This article describes the migration of such a system, a set of scheduled batch jobs responsible for generating a delta index used in a search and ads retrieval pipeline. These jobs were moved to a continuously running micro-batch model using Spark Structured Streaming in micro-batch mode"
infoq.com ↗
The primary bottleneck was scheduling delay and orchestration overhead, not computation
"The primary bottleneck was not computation, but scheduling delay and orchestration overhead, especially during bursts and failures."
infoq.com ↗
The pipeline covered several millions of documents; full index size was hundreds of gigabytes, delta size typically tens of gigabytes
"The system operated at the scale of several millions of documents, with a full index size on the order of hundreds of gigabytes and delta size typically in the tens of gigabytes range."
infoq.com ↗
Full index rebuild took approximately two to three hours; validation and deployment pushed total time to approximately five hours
"A full rebuild took approximately two to three hours, followed by validation and deployment taking total time up to approximately five hours."
infoq.com ↗
New incremental data arrived approximately every five to seven minutes; each delta run covered roughly the last five hours
"New incremental data arrived approximately every five to seven minutes depending on volume of new ads and updates to existing campaigns. Each delta run covered roughly last five hours"
infoq.com ↗
Staleness in the pipeline translated into delayed ad go-live and stale retrieval results
"delays translated into a delayed go live for updated ads and related metadata resulting in stale retrieval results and lost opportunity if ads ran out of budget before the latest version made it to production."
infoq.com ↗
The team did not use Spark's native checkpointing or event-time watermark semantics; instead maintained an external logical watermark by partition timestamp
"progress tracking did not rely on its native checkpointing or event-time watermark semantics, because the pipeline advanced based on partition-level progress rather than on continuous event streams. Instead, a logical watermark was maintained externally, representing the latest processed partition based on partition timestamps."
infoq.com ↗
S3 eventual consistency made success-file and completion-marker patterns unreliable; deterministic rate-based progress was more reliable
"For object store-based ingestion, especially in systems with eventual consistency (such as Simple Storage Service (S3) storage), relying on success files or completion markers breaks down in practice, and deterministic, rate-based progress is often more reliable for micro-batch streaming."
infoq.com ↗
Skipping to the latest available partition on restart was more valuable than exhaustively replaying historical data in freshness-driven pipelines
"Lag and restart behavior must be designed explicitly; in freshness-driven pipelines with overlapping window semantics, skipping to the latest available partition can be more valuable than exhaustively replaying historical data."
infoq.com ↗
Long-running streaming jobs should treat restarts as a normal operational mechanism rather than a failure condition
"Long-running streaming jobs should be built to restart cleanly and regularly, treating restarts as a normal operational mechanism rather than a failure condition."
infoq.com ↗
Record-level streaming introduces unnecessary operational risk in batch-oriented systems without delivering meaningful benefits
"Record-level streaming is often proposed as the 'correct' solution, but in batch-oriented systems it introduces unnecessary operational risk without delivering meaningful benefits."
infoq.com ↗

Written and edited by AI agents · Methodology

Micro-batch streaming sidesteps Kafka for search index freshness

Get the signal before the noise.

Get the signal before the noise.