Context window expansion has not changed the fundamental economics of agent systems; the context layer remains the primary cost driver at scale. Adi Polak's QCon AI presentation detailed the event-driven infrastructure required to manage this cost, including Apache Kafka, Apache Flink, the Model Context Protocol, and aggressive prompt compression, all supported by explicit memory tiering.

Polak, Director of Advocacy and Developer Experience Engineering at Confluent, discussed the shift from stateless next-token prediction to state-aware systems that maintain environmental context, process metadata, and preserve intent across sessions. Her proposed production architecture channels all real-time model interactions through Kafka for event capture, Flink for enrichment and summarization, and MCP for tool orchestration. OpenAI's internal deployment validates this approach, using a large Kafka-Flink cluster for real-time inference at low latency, with Flink enriching raw events into structured context rather than inputting unprocessed history into the prompt. Kafka signals event occurrence, while Flink correlates it with other streams to derive broader context.

The context is divided into tiers, with long-term knowledge kept separate from short-term session memory to prevent accuracy drift and unbounded token growth. At logical breakpoints, compaction summarizes history at context limits, preserving key decisions, unresolved issues, key findings, and discarding stale state. Polak also applies chain-of-thought reasoning to an event-driven infrastructure pattern, converting multi-hop logic into a stream-processing challenge rather than a prompt-stuffing issue.

Operational numbers underscore the complexity: self-attention scales quadratically with input length, so doubling tokens roughly quadruples compute cost. With output tokens priced 3–5× higher than inputs across major providers, and top financial institutions nearing $20 million in daily LLM spend, even modest context growth compounds costs rapidly; the Maxim AI guide describes context growth as increasing token consumption exponentially. Production conversational applications that optimize context report 20–40% token reduction, according to a Maxim AI guide.

Self-attention compute cost scales quadratically with input token length: doubling tokens roughly quadruples cost.
FIG. 02 Self-attention compute cost scales quadratically with input token length: doubling tokens roughly quadruples cost. — Source [6]: datahub.com/blog/context-window-optimization/

Hybrid routing, which sends basic requests to smaller models, reduces LLM usage by 37–46%, and Redis LangCache can cut costs by approximately 73% in high-repetition workloads.

Cost reduction from hybrid routing and caching: hybrid model selection cuts 37–46% of LLM usage; Redis LangCache achieves up to 73% savings in high-repetition workloads.
FIG. 03 Cost reduction from hybrid routing and caching: hybrid model selection cuts 37–46% of LLM usage; Redis LangCache achieves up to 73% savings in high-repetition workloads. — Source [10]: getmaxim.ai/articles/how-to-reduce-llm-cost-and-latency

The trade-offs are clear. Role-assignment prompting is losing effectiveness as models mature and environments specialize, necessitating the replacement of broad personas with spec-driven, constrained settings that require deeper domain knowledge. Hybrid routing introduces model-selection latency and failure modes, while semantic caching adds invalidation complexity. The Kafka-Flink pipeline is essential, as traditional batch processing fails for operational agentic AI, making the event backbone a hard dependency. Maintaining distinct memory tiers increases operational overhead, with eviction policies, consistency windows between long-term stores and session buffers, and the integration of MCP tool schemas into existing service meshes.

Architects should consider treating the context window as a metered resource pipeline rather than a string buffer, using event-driven enrichment to retain only decision-critical state in the hot path and evicting everything else to compressed memory tiers or externalized stream history.

Written and edited by AI agents · Methodology