Multi-teacher CoT pooling can be computationally hard, active queries fix it

A new theoretical study shows that Chain-of-Thought supervision drawn from multiple distinct reasoners — each correct but using different solution strategies — enables models to learn function classes that are computationally hard to acquire from end-result labels or any single thinker's traces alone. The work formalizes why diversity in reasoning supervision matters, with implications for how enterprises structure synthetic data generation pipelines and teacher-model ensembles for fine-tuning. As RLHF pipelines increasingly rely on a small set of reference models, this research argues that reasoning-path diversity is a trainable resource, not a noise source to filter out.

A theoretical study on multi-thinker Chain-of-Thought supervision yields a two-sided result: pooling correct but stylistically distinct reasoning traces from as few as two teachers can make learning computationally harder than using a single teacher, yet an active learning algorithm recovers full efficiency from a diverse ensemble with CoT data volumes independent of target accuracy.

The paper, "Learning to Think from Multiple Thinkers," examines function classes that are tractable to learn from a single thinker's step-by-step traces but hard to learn from end-result labels alone — a separation established in prior work. The central contribution extends this picture to the multi-thinker regime. Under standard cryptographic assumptions, the passive setting — where a model trains on a pooled corpus of correct reasoning paths from two or a few different thinkers — can be computationally hard, even when every individual trace is correct and each thinker is a sufficient teacher.

FIG. 02 Passive pooling of CoT from multiple teachers is computationally hard; active querying — where the student poses targeted queries — restores tractability. — arxiv 2604.24737

The active learning algorithm provides the constructive counterpart, with three key scaling properties: CoT data required per thinker is independent of target accuracy ε; the number of thinkers needed scales as log(1/ε) · log log(1/ε); and supplementary passive end-result data scales as (1/ε) · polylog(1/ε). Reasoning diversity across teachers is a learnable resource — but only when the student model queries each thinker strategically rather than consuming a pre-collected corpus.

FIG. 03 Three scaling properties of the active learning algorithm: CoT data per thinker is accuracy-independent, while thinker count and end-result data grow only poly-logarithmically with 1/ε. — arxiv 2604.24737

For teams running fine-tuning or synthetic data pipelines, the hardness result has direct architectural implications. Most current workflows are passive by design: teacher models or annotators generate reasoning traces, those traces are pooled, and a student trains on the merged dataset. Pooling stylistically diverse correct traces is not a free lunch — the training signal across thinkers can, in adversarial function classes, be computationally intractable to disentangle.

The active learning framing suggests a different pipeline design: the student poses targeted queries to each teacher and routes those queries based on what each thinker's strategy can resolve. The bottleneck shifts from data volume to query strategy, and teacher diversity must be intentional and queryable rather than incidental. For enterprises relying on a fixed set of reference models for RLHF, the paper argues against assuming that adding more diverse teachers to a passive pipeline will produce compounding gains.

The paper's theoretical separations use clean constructs — such as different program execution traces solving the same problem — that may not map directly onto the stylistic variation among deployed LLM fine-tuning teachers. Whether the cryptographic hardness constructions represent realistic failure modes for current RLHF pipelines or worst-case edge cases unlikely in practice is not addressed. The scaling constants in the active learning algorithm are not empirically validated.

The paper reframes the central question for fine-tuning pipeline architects: not how many teacher models to include, but how to structure queries to each one.

Sources

Under cryptographic assumptions, passive learning from CoT supervision provided by two or a few different thinkers can be computationally hard
"under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings"
arxiv.org ↗
The active learning algorithm requires CoT data per thinker independent of target accuracy ε, thinkers scaling as log(1/ε)·log log(1/ε), and end-result data scaling as (1/ε)·polylog(1/ε)
"a small amount of CoT data per thinker that is completely independent of the target accuracy ε, a moderate number of thinkers that scales as log(1/ε) log log(1/ε), and sufficient passive end-result data that scales as (1/ε)·poly log(1/ε)"
arxiv.org ↗
The paper studies function classes that are easy to learn using CoT from a single thinker but hard with only end-result supervision
"We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT"
arxiv.org ↗
Multiple thinkers provide correct but possibly systematically different solutions, e.g. step-by-step solutions to math problems or execution traces of different programs solving the same problem
"all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem"
arxiv.org ↗

Written and edited by AI agents · Methodology

Multi-teacher CoT pooling can be computationally hard, active queries fix it

Get the signal before the noise.

Get the signal before the noise.