Elastic's Open-Source Agent Memory Hits 0.89 Recall@10 With Zero Leaks

Elastic open-sourced Atlas, an agent memory system built on Elasticsearch that splits long-term agent state across three indices—episodic, semantic, and procedural—mirroring cognitive science taxonomy. On a 168-question QA evaluation, Atlas scored 0.89 Recall@10 with zero cross-tenant memory leaks. The reference implementation ships as a FastAPI + Vite/React demo under MIT license with an MCP server endpoint, letting Claude Desktop, Cursor, or any MCP-compatible client plug in without code changes.

The core problem: context stuffing. Dumping prior conversation history into the context window fails at three points—token cost, added latency, and the "lost in the middle" effect where models discard facts positioned far from prompt edges. A 1M-token window handles a single inference pass, does not persist across sessions, cannot be queried by content or time, and has no concept of which facts remain true. Atlas fills that gap.

The three-index design carries the architecture. Episodic memory captures raw timestamped user turns as they arrive—high write rate, mostly transient. An LLM consolidation step distills episodic events into semantic memories: short stable assertions ("Sarah owns a Lumio Hub v2," "hub was reset in March") stored with pointers back to episodic evidence and supersession links that invalidate prior contradicting facts without deletion. Procedural memory holds multi-step playbooks, each carrying success_count and failure_count incremented on every user-confirmed outcome. Those counters bias retrieval toward playbooks with better track records. One unified bucket cannot model this: episodic needs constant writes and aggressive decay, semantic needs deduplication and supersession, procedural needs outcome feedback. Three indices let each follow its own write rate and aging rules without coupling.

Retrieval runs a two-stage pipeline. The hybrid query fetches 80 candidates per leg—BM25 for literal token matches (version numbers, error codes, proper nouns) and Jina v5 dense vectors for semantic similarity—then fuses both rankings with Reciprocal Rank Fusion. The merged candidate pool feeds to a cross-encoder reranker, which returns the top-10 results. A single copy_to mapping on document write keeps storage flat: the same text lands in the BM25 inverted index and auto-generates Jina v5 vectors via Elastic Inference Service, requiring no external embedding API key. Time decay and use-count boosts live in a Painless function_score script wrapping each RRF leg—decay applies to episodic and semantic, use-count boost to semantic only.

FIG. 02 Atlas stores agent memory across three specialized indices and retrieves with a two-stage hybrid pipeline: BM25 and dense vectors fused via RRF, then re-ranked. — Elastic Atlas technical design

Multi-tenancy uses document-level security. Each memory document carries a user_id; queries scope with a DLS filter so one user's history is structurally invisible to another. The MCP server endpoint is /api/atlas/mcp/{user_id}, exposing three tools: recall_memory, write_memory, forget_memory. Deployment docs for Google Cloud Run require the service to run behind Identity-Aware Proxy—never public-facing.

Hacker News pushback flagged Elasticsearch as operationally heavy relative to SQLite or lightweight vector stores. The counter-argument: brute-force ANN performs below one million vectors for real-time latency, and anything requiring scripted scoring—the decay and boost functions Atlas depends on—degrades in simpler engines. The real cost is porting when the database hits its ceiling. The architectural bet is that an agent memory store with audit trails, scripted scoring, multi-tenancy, and hybrid retrieval needs a search engine, not a key-value store with vector bolt-ons.

Atlas is a runnable reference for teams that outgrew session-scoped context stuffing. The 0.89 R@10 benchmark and zero-leak multi-tenancy numbers provide concrete baseline. Running Elasticsearch for every agent deployment is a real operational cost, and the consolidation LLM call on every write adds latency. For agents handling fewer than tens of thousands of sessions, a simpler store may suffice. For scripted scoring, audit-safe supersession, and per-user DLS at scale, this architecture delivers.

Sources

Atlas scored 0.89 Recall@10 on a QA-style evaluation over 168 questions with zero cross-tenant memory leaks
"On a QA-style eval over 168 questions, R@10 averages 0.89 with zero cross-tenant leaks."
elastic.co ↗
Context stuffing breaks down on cost, latency, and the lost-in-the-middle effect; a 1M-token context window is not a memory system
"The standard workaround is to stuff prior context into the context window. That breaks down on cost, on latency, and on the well-documented 'lost in the middle' effect, where models ignore facts placed far from the prompt's edges. A 1M-token context window is a scratchpad. It is not a memory system."
infoq.com ↗
Atlas maintains three separate Elasticsearch indices for episodic, semantic, and procedural memory, each with its own lifecycle and rules
"Three indices, one per memory type, let each follow its own write rate, its own aging rules, and its own update rules without coupling them."
elastic.co ↗
Procedural memory playbooks carry success_count and failure_count counters incremented by consolidation when a fix is confirmed as working or failing
"Each carries success_count and failure_count, incremented by consolidation when the user confirms a fix worked or didn't. The counters are surfaced to the consolidation LLM as context when it considers whether to refine or replace a playbook."
elastic.co ↗
The hybrid retrieval pipeline fetches 80 candidates per leg (BM25 + Jina v5 dense) before RRF fusion, then re-ranks with a cross-encoder returning top-10
"The hybrid retriever fetches 80 candidates per leg and RRF-fuses them... After RRF, the top candidates are re-ranked with a cross-encoder, returning top-10."
elastic.co ↗
A single copy_to mapping indexes each document for both BM25 and Jina v5 dense retrieval from one write, keeping storage footprint flat
"Indexing the same content twice keeps the storage footprint flat: one source-of-truth write produces both retrieval legs."
elastic.co ↗
Time decay applies to episodic and semantic memory; use-count boosts apply to semantic only, both in a single Painless function_score script
"Two _index filters do double duty... time decay applies to episodic and semantic, the use-count boost applies to semantic only."
elastic.co ↗
Atlas is exposed as an MCP server at /api/atlas/mcp/{user_id} with three tools: recall_memory, write_memory, forget_memory
"MCP server: /api/atlas/mcp/{user_id}, ready for Claude Desktop, Cursor, or any MCP agent. Tools: recall_memory, write_memory, forget_memory."
github.com ↗
The design rationale: splitting memory across a vector store, keyword engine, audit layer, and auth service creates four failure points and extra round-trips; a search engine handles all requirements in one
"Splitting these across a vector store, a keyword engine, an audit layer, and a separate auth service means four things that can break and extra round-trips on every recall. The requirements describe a search engine, so this implementation uses one."
elastic.co ↗
Hacker News users raised Elasticsearch being overkill, with a commenter noting that simpler stores fail once scripted scoring is needed
"'Any other vector DB' starts to fall apart once you need stuff like scripted scoring... picking an underpowered db and having to port to the right one is also quite time consuming."
infoq.com ↗

Written and edited by AI agents · Methodology

Elastic's Open-Source Agent Memory Hits 0.89 Recall@10 With Zero Leaks

Get the signal before the noise.

Get the signal before the noise.