RESEARCHBY AI|EXPERT SCOUT· Tuesday, May 19, 2026· 3 MIN READ
SAEBench Metrics Rank SAEs Backwards, Audit Finds
A methodological audit of SAEBench, the de-facto standard for sparse autoencoder evals, finds that two of its quality metrics are unreliable: Targeted Probe Performance shows reseed noise and poor ground-truth correlation, while Logit Product Accuracy fails to discriminate across training trajectories. Architect angle: if your mechanistic interpretability evals rely on SAEBench scores, your ranking of SAE architectures may be inverted—the paper audit chain is in the arxiv dump.
Two core SAEBench metrics—Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR)—produce inverted quality rankings, according to a methodological audit published on arXiv on May 18, 2026, by David Chanin of Decode Research, MATS, and UCL.
TPP scores SAEs worse the longer they train, contradicting the signal practitioners need. SCR becomes negatively correlated with ground-truth quality at large top-N settings. Both fail multiple of the paper's five desiderata at their canonical settings, across all evaluation lenses teams actually use. Neither metric should be used for SAE evaluation.
Chanin tested metrics across three complementary lenses: reseed noise (five runs per metric on fixed SAEs), validity on synthetic SAEs with computable ground-truth quality, and discriminability across training trajectories. He trained two SAE panels—one with deliberately large differences (BatchTopK vs Matryoshka, k∈{50,100}) and one with tighter single-architecture variants. The cross-architecture panel tests whether a metric distinguishes very different SAEs; the single-architecture panel asks whether it can distinguish small variants that engineering teams actually compare.
Per-metric coefficient of variation spanned nearly two orders of magnitude, revealing a minimum-reliable-difference threshold the field had not previously quantified. Every metric other than TPP and SCR is noisier and less discriminative than assumed, even when SAE differences are large. The most reliable metric tested is the sae-probes variant of k-sparse probing. Yet sae-probes cannot reliably separate single-architecture variants—the comparison that matters most when teams choose between, say, two Matryoshka configurations. Automated interpretation and RAVEL were not testable under the synthetic-SAE lens because both require natural-language concepts the synthetic dictionary lacks.
FIG. 02Coefficient of variation across SAE evaluation metrics. Higher values indicate noise and unreliability; sae-probes is substantially more stable than TPP or SCR.— Audit per arXiv:2605.18229
For teams shipping SAE-based tooling, the implication is direct: leaderboard rankings built on TPP or SCR scores are inverted. If architecture selection or training-run pruning logic depends on either metric, you are optimizing for noise masquerading as signal. The paper does not yet offer a replacement metric that clears every desideratum. Sae-probes is the current best option but is incomplete. The audit is a benchmarking study, not a deployment study with latency or cost implications.
Teams should audit SAE evaluation pipelines against the five desiderata in this paper before publishing results or acting on architecture comparisons. Any ranking built on TPP or SCR should be treated as noise until re-evaluated with sae-probes or a future metric that survives all three audit lenses.