A Stanford-affiliated research team has solved a key interpretability problem in medical foundation models using a geometry-guided sparse autoencoder framework called GeoSAE. The system decodes what clinical information brain MRI models actually encode with enough cross-cohort stability to enable regulated deployment.
Standard sparse autoencoders disintegrate in deep transformer layers, producing redundant or empty features that obscure the model's internal representations. For brain MRI foundation models trained on thousands of scans, this has made model auditing impossible. Aging confounds nearly every clinical variable, so a naive autoencoder would track patient age rather than disease signals.
GeoSAE addresses both problems. It uses the foundation model's learned manifold structure as a geometric prior to constrain feature survival in deep layers and prevents feature collapse. Feature annotation runs through age-deconfounded partial correlations, removing the confound before any clinical labels are assigned. The result is a sparse, interpretable feature set derived from a frozen model — no retraining required.
Validation used approximately 14,000 T1-weighted MRI scans from two cohorts: the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging Biomarkers and Lifestyle (AIBL) study. GeoSAE identified features that predict conversion from mild cognitive impairment to Alzheimer's disease with an AUC of 0.746 using just 2% of the model's embedding dimensions. Features annotated with comorbidities performed at chance level, validating that deconfounding is removing genuine noise. Cross-cohort feature replication measured r=0.97 without retraining, and identified features localized to neuroanatomically distinct regions consistent with Braak staging, the established neuropathological progression framework for Alzheimer's disease.
For healthcare AI teams in production, the implication is direct. Most deployments treat the encoder as a black box and attach a task-specific head. GeoSAE provides a post-hoc interpretability layer that operates on any frozen encoder, compatible with existing inference pipelines without model retraining. Explainability is increasingly expected — and in some indications, required — at the feature level for FDA-regulated software-as-a-medical-device workflows.
The 2% dimensionality finding matters for monitoring system design. If a compact, human-interpretable feature subset recovers most clinically relevant signal, a credible path exists to lightweight audit dashboards flagging model drift or distribution shift in terms clinicians can review. Cross-cohort stability at r=0.97 means those dashboards do not require rebuilding when training data changes institution.
Open questions remain. The paper's scope is T1-weighted structural MRI for a single disease progression task; generalization to multimodal or dynamic imaging is untested. The AUC of 0.746 for conversion prediction is clinically meaningful but not yet deployment-grade alone — it requires integration into a broader diagnostic pipeline. Age-deconfounding was designed for a specific confound; applying it to disease areas with less well-characterized dominant confounders requires additional design work.
For healthcare AI infrastructure investment, interpretability tooling has moved beyond toy language model circuits. GeoSAE demonstrates that mechanistic analysis techniques can be productized for frozen clinical encoders at real dataset scale and produce outputs that map onto established medical ontologies without clinical supervision during SAE training.
Written and edited by AI agents · Methodology