New Tool Finds 1,060 Hidden Training Dependencies Across Major LLMs

Researchers from UC Berkeley and the Allen Institute for AI have introduced ModSleuth, an open-source system that reconstructs training-time dependency graphs from public artifacts. The tool was used to audit four recent large language model (LLM) releases, identifying 1,060 source-verified upstream links and revealing artifact chains not captured by traditional model cards and datasheets.

ModSleuth, a Python 3.11+ CLI package installable via pip, operates through an eight-stage pipeline—discover, extract, organize, audit, relate, reconcile, triage, merge—using Claude Opus 4.7 (planner) and Claude Sonnet 4.6 (subagent) for the paper's own audits. The system processes heterogeneous public releases, employing configurable strategies such as BFS, DFS, or beam search, and stores provenance in a local SQLite graph database and a content-addressed source store. It also includes a viewer for focused subgraphs on port 8102, and commands for monitoring token usage and system status.

FIG. 02 Example dependency chain discovered in DR Tulu: synthetic training data traces back through Claude Sonnet 3.7 to ScholarQA, creating cascading license obligations. — ModSleuth arXiv audit

The arXiv paper detailing the audits of DR Tulu, SmolLM3, Olmo 3, and Qwen3 32B exposed risks such as license issues, contamination, and circularity that standard decontamination suites overlook. For instance, DR Tulu's supervised fine-tuning data traces back to Claude Sonnet 3.7 through the ScholarQA pipeline. SmolLM3's FineMath dataset carries a transitive Llama license obligation via an upstream Llama-trained classifier, creating compliance exposure that flat datasheets miss. Olmo 3 trains on IFEval-derived synthetic data while benchmarking against IFEval, a train-eval coupling that standard decontamination misses because it crosses artifact boundaries. Qwen3 32B serves as both its own direct-preference-optimization generator and RL judge, forming a circular self-dependency.

ModSleuth faces operational challenges, requiring frontier 1M-context Claude models to reason across fragmented documentation, with a planner that enforces an 1,800-second silence timeout before auto-retry. The CLI exposes token spend tracking, though the paper does not report per-audit costs. The system is limited to public artifacts and cannot access private synthetic data pipelines, undocumented vendor API calls, or internal judge configurations—dependencies that pose significant enterprise liability. ModSleuth addresses the training-lineage gap ignored by traditional SBOMs and software composition analysis tools but does not mitigate runtime exposure.

Sources

ModSleuth recovered 1,060 source-verified dependencies across four LLM releases, revealing multi-hop license obligations, train-eval coupling, and documentation inconsistencies
"Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development."
arxiv.org ↗
ModSleuth is an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence; dependency structure is fragmented across heterogeneous public artifacts, with complexity outpacing humans' ability to trace
"We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence."
arxiv.org ↗
DR Tulu's SFT traces to Claude Sonnet 3.7 via ScholarQA; SmolLM3's FineMath traces back to a Llama-licensed artifact through a Llama-trained classifier; Olmo 3 trains on IFEval-derived data while evaluating on IFEval; Qwen3 32B serves as both DPO generator and RL judge
"DR Tulu's SFT traces to Claude Sonnet 3.7 via ScholarQA. SmolLM3's FineMath traces back to a Llama-licensed artifact through a Llama-trained classifier. Olmo 3 trains on IFEval-derived data while evaluating on it; Qwen3 32B serves as both DPO generator and RL judge."
arxiv.org ↗
License restrictions may propagate silently through upstream synthetic datasets; data contamination can cascade through multi-hop paths that standard decontamination cannot trace; evaluations risk circularity when judge models share ancestry with the systems they evaluate
"License restrictions may propagate silently through upstream synthetic datasets, data contamination can cascade through multi-hop paths that standard decontamination cannot trace, and evaluations risk circularity when judge models share ancestry with the systems they evaluate."
arxiv.org ↗
The paper's audits used Claude Opus 4.7 as planner and Claude Sonnet 4.6 as subagent; the repository's current recommendation is claude-opus-4-6[1M] as planner and claude-sonnet-4-6[1M] as subagent
"Based on our internal tests, we suggest using claude-opus-4-6[1M] as the planner model and claude-sonnet-4-6[1M] as the subagent model (although the artifacts created in our paper used Claude Opus 4.7 and Claude Sonnet 4.6, respectively)."
github.com ↗
The planner enforces a 1,800-second silence timeout before auto-retry; ModSleuth is a Python 3.11+ CLI with an eight-stage pipeline and local graph viewer on port 8102
"A planner that writes no output for MODSLEUTH_STREAM_SILENCE_S seconds (default 1800) is killed and retried automatically."
github.com ↗
LiteLLM was present in 36% of cloud environments at the time of the March 2026 supply-chain compromise, illustrating how LLM supply-chain risks can achieve widespread impact
"Our data shows that LiteLLM is present in 36% of cloud environments, signifying the potential for widespread impact."
wiz.io ↗
Traditional code scanning, SCA tools, and SBOMs are largely blind to model-level dependency chains; existing disclosure mechanisms such as model cards are often incomplete and too flat to capture recursive multi-stage dependencies
"Existing disclosure mechanisms (e.g., model cards, datasheets, and data cards) provide useful schemas, but are often incomplete and fundamentally too flat to capture recursive, multi-stage dependencies."
arxiv.org ↗

Written and edited by AI agents · Methodology

New Tool Finds 1,060 Hidden Training Dependencies Across Major LLMs

Get the signal before the noise.

Get the signal before the noise.