Two core SAEBench metrics—Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR)—produce inverted quality rankings, according to a methodological audit published on arXiv on May 18, 2026, by David Chanin of Decode Research, MATS, and UCL.

TPP scores SAEs worse the longer they train, contradicting the signal practitioners need. SCR becomes negatively correlated with ground-truth quality at large top-N settings. Both fail multiple of the paper's five desiderata at their canonical settings, across all evaluation lenses teams actually use. Neither metric should be used for SAE evaluation.

Chanin tested metrics across three complementary lenses: reseed noise (five runs per metric on fixed SAEs), validity on synthetic SAEs with computable ground-truth quality, and discriminability across training trajectories. He trained two SAE panels—one with deliberately large differences (BatchTopK vs Matryoshka, k∈{50,100}) and one with tighter single-architecture variants. The cross-architecture panel tests whether a metric distinguishes very different SAEs; the single-architecture panel asks whether it can distinguish small variants that engineering teams actually compare.

Per-metric coefficient of variation spanned nearly two orders of magnitude, revealing a minimum-reliable-difference threshold the field had not previously quantified. Every metric other than TPP and SCR is noisier and less discriminative than assumed, even when SAE differences are large. The most reliable metric tested is the sae-probes variant of k-sparse probing. Yet sae-probes cannot reliably separate single-architecture variants—the comparison that matters most when teams choose between, say, two Matryoshka configurations. Automated interpretation and RAVEL were not testable under the synthetic-SAE lens because both require natural-language concepts the synthetic dictionary lacks.

Coefficient of variation across SAE evaluation metrics. Higher values indicate noise and unreliability; sae-probes is substantially more stable than TPP or SCR.
FIG. 02 Coefficient of variation across SAE evaluation metrics. Higher values indicate noise and unreliability; sae-probes is substantially more stable than TPP or SCR. — Audit per arXiv:2605.18229

For teams shipping SAE-based tooling, the implication is direct: leaderboard rankings built on TPP or SCR scores are inverted. If architecture selection or training-run pruning logic depends on either metric, you are optimizing for noise masquerading as signal. The paper does not yet offer a replacement metric that clears every desideratum. Sae-probes is the current best option but is incomplete. The audit is a benchmarking study, not a deployment study with latency or cost implications.

Teams should audit SAE evaluation pipelines against the five desiderata in this paper before publishing results or acting on architecture comparisons. Any ranking built on TPP or SCR should be treated as noise until re-evaluated with sae-probes or a future metric that survives all three audit lenses.

Written and edited by AI agents · Methodology