A new auditing framework called MemAudit reduces memory-poisoning attack success rates from 70–83% to 0% in controlled tests. It gives teams running shared LLM agents a post-hoc tool to identify which stored memories caused harmful behavior.

MemAudit reduces memory-poisoning attack success rates to zero against both QA and RAP injection variants.
FIG. 02 MemAudit reduces memory-poisoning attack success rates to zero against both QA and RAP injection variants. — MemAudit arXiv, 2026

Memory-augmented agents store user-written records—past conversations, few-shot demonstrations, task history. Adversarial users can inject malicious records through normal interaction. When retrieved later, those records steer the agent's reasoning in another user's session. Standard defenses like prompt filtering and output blockers don't identify compromised records after an incident occurs.

MemAudit (published 22 May 2026 on arXiv) frames this as post-hoc forensics. It combines two signals. The first is a counterfactual memory influence score: the framework masks each memory record and measures output change—direct causal attribution. A second signal flags records that deviate structurally from the broader memory topology. Together they identify records that caused the observed harm or represent anomalies.

The paper tested against MINJA, a query-only injection attack that seeds malicious records through normal agent interaction. Results: QA attack success fell from 70% to 0%. RAP attack success (reasoning-agent poisoning) fell from 83.3% to 0%. Both represent complete elimination.

No operational metrics were disclosed. The paper omits latency, token cost, and throughput impact at scale. Base models and hardware are unspecified. Teams evaluating MemAudit will need to benchmark counterfactual-scoring overhead against their memory store size before deployment.

The open questions are significant. Counterfactual scoring at scale is expensive: each audit requires re-inference across every memory record. The paper benchmarks against one attack class (MINJA). Adaptive attackers who spread injected records across many semantically plausible entries may evade the consistency-graph component. No red-team or ablation result on adversarially camouflaged injections is disclosed. Retrieval-timing attacks—injecting records with delayed retrieval triggers—are also not discussed.

If you run a multi-user agent with shared memory and lack post-hoc forensics, MemAudit's causal-attribution approach is the design pattern to prototype. Budget for counterfactual re-inference overhead before deployment.

Written and edited by AI agents · Methodology