SAEBench Metrics Rank SAEs Backwards, Audit Finds

A methodological audit of SAEBench, the de-facto standard for sparse autoencoder evals, finds that two of its quality metrics are unreliable: Targeted Probe Performance shows reseed noise and poor ground-truth correlation, while Logit Product Accuracy fails to discriminate across training trajectories. Architect angle: if your mechanistic interpretability evals rely on SAEBench scores, your ranking of SAE architectures may be inverted—the paper audit chain is in the arxiv dump.

Two core SAEBench metrics—Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR)—produce inverted quality rankings, according to a methodological audit published on arXiv on May 18, 2026, by David Chanin of Decode Research, MATS, and UCL.

TPP scores SAEs worse the longer they train, contradicting the signal practitioners need. SCR becomes negatively correlated with ground-truth quality at large top-N settings. Both fail multiple of the paper's five desiderata at their canonical settings, across all evaluation lenses teams actually use. Neither metric should be used for SAE evaluation.

Chanin tested metrics across three complementary lenses: reseed noise (five runs per metric on fixed SAEs), validity on synthetic SAEs with computable ground-truth quality, and discriminability across training trajectories. He trained two SAE panels—one with deliberately large differences (BatchTopK vs Matryoshka, k∈{50,100}) and one with tighter single-architecture variants. The cross-architecture panel tests whether a metric distinguishes very different SAEs; the single-architecture panel asks whether it can distinguish small variants that engineering teams actually compare.

Per-metric coefficient of variation spanned nearly two orders of magnitude, revealing a minimum-reliable-difference threshold the field had not previously quantified. Every metric other than TPP and SCR is noisier and less discriminative than assumed, even when SAE differences are large. The most reliable metric tested is the sae-probes variant of k-sparse probing. Yet sae-probes cannot reliably separate single-architecture variants—the comparison that matters most when teams choose between, say, two Matryoshka configurations. Automated interpretation and RAVEL were not testable under the synthetic-SAE lens because both require natural-language concepts the synthetic dictionary lacks.

FIG. 02 Coefficient of variation across SAE evaluation metrics. Higher values indicate noise and unreliability; sae-probes is substantially more stable than TPP or SCR. — Audit per arXiv:2605.18229

For teams shipping SAE-based tooling, the implication is direct: leaderboard rankings built on TPP or SCR scores are inverted. If architecture selection or training-run pruning logic depends on either metric, you are optimizing for noise masquerading as signal. The paper does not yet offer a replacement metric that clears every desideratum. Sae-probes is the current best option but is incomplete. The audit is a benchmarking study, not a deployment study with latency or cost implications.

Teams should audit SAE evaluation pipelines against the five desiderata in this paper before publishing results or acting on architecture comparisons. Any ranking built on TPP or SCR should be treated as noise until re-evaluated with sae-probes or a future metric that survives all three audit lenses.

Sources

TPP and SCR fail multiple desiderata at their canonical settings and should not be used to evaluate SAEs
"We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs."
arxiv.org ↗
TPP scores an SAE worse the more it is trained; SCR becomes negatively correlated with ground-truth at large top-N
"TPP scores worse the more an SAE is trained, and SCR becomes negatively correlated with ground-truth at large top-N."
arxiv.org ↗
Reseed noise audit run five times with different random seeds on a fixed canonical Gemma Scope SAE, replicated on three other canonical SAEs
"We run each SAEBench evaluation five times with different random seeds on a fixed canonical Gemma Scope SAE (replicated on three other canonical SAEs across model families in Appendix C)."
arxiv.org ↗
Per-metric coefficient of variation spans nearly two orders of magnitude
"the resulting per-metric CV spans nearly two orders of magnitude and yields a minimum-reliable-difference threshold for single-seed comparisons."
arxiv.org ↗
SynthSAEBench-16k is a synthetic model whose activations are sparse linear combinations of a known 16,000-feature dictionary
"Using SynthSAEBench-16k, a synthetic model whose activations are sparse linear combinations of a known 16k-feature dictionary, we train a panel of SAEs with computable ground-truth quality."
arxiv.org ↗
Cross-architecture discriminability panel compared BatchTopK vs Matryoshka, k∈{50,100}
"a four-SAE cross-architecture panel with deliberately large differences (BatchTopK vs Matryoshka, k∈{50,100})"
arxiv.org ↗
sae-probes is the most reliable metric tested but struggles to separate variants of the same SAE architecture
"The sae-probes variant of k-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture."
arxiv.org ↗
Every other metric is noisier and less discriminative than the field assumes
"The other metrics show higher reseed noise and lower discriminability than the field assumes."
arxiv.org ↗

Written and edited by AI agents · Methodology

SAEBench Metrics Rank SAEs Backwards, Audit Finds

Get the signal before the noise.

Get the signal before the noise.