Tahoe Text-to-SQL System Cuts Compiler Feedback by 96%

Tahoe, a Text-to-SQL system developed by ByteDance and Georgia Tech, has significantly enhanced the performance of GPT-5.5 on the Spider 2.0-Snow benchmark. The pass rate increased from 61.95% to 79.42%, a gain of 17.47 percentage points, and compiler-feedback critic rounds were reduced by 95.7%, from 2.79 to 0.12 per query. This improvement is achieved by replacing multi-turn agentic refinement with a single hint-augmented pass, using a Hint Bank learned from error traces and transferred without retraining, which also improves Doubao-2.0-lite by 19.7 percentage points on the same benchmark.

FIG. 02 Tahoe lifts Spider 2.0 pass rate from 61.95% to 79.42% on GPT-5.5, and Pass@4 from 72.57% to 87.61%. — Tahoe paper, 2606.12387

The system's architecture separates prompt design from production signal through a two-phase lifecycle. During development, compiler errors are distilled into Syntax Hints, capturing dialect-specific constraints such as quoting mixed-case Snowflake identifiers or avoiding unsupported functions. Execution and logic failures become Semantic Hints, capturing schema-specific conventions. These populate a structured Hint Bank, where conflicting hints are modeled as competing strategies under a shared natural-language trigger, annotated with recency signals and post-hoc attribution statistics summarizing empirical success, harm, inertness, and support. At inference time, Tahoe retrieves applicable strategies, plans their combination, and synthesizes SQL in one shot. This approach avoids the compute trap of agentic test-time scaling, the rigidity trap of supervised fine-tuning, and the context noise of documentation RAG.

The stack is model-agnostic but tested on GPT-5.5, with cross-model validation on Doubao-2.0-lite and a Qwen-Coder baseline scoring approximately 30% execution accuracy on Spider 2.0 without scaffolding. Evaluation uses 113 supervised examples from Spider 2.0-Snow-0212. The paper claims a 100% Snowflake syntax pass rate but does not provide a baseline for that metric and omits details on inference hardware, serving layer, absolute latency, or cost per query.

Operational gains are measured in LLM-call efficiency rather than wall-clock time. The reduction in critic rounds directly addresses the agentic compute trap, collapsing multi-turn refinement into near-single-pass generation. Pass@4 improves from 72.57% to 87.61%, indicating that even with four samples, the hint-augmented prompt outperforms naive multi-sample generation. Syntax hints generalize well to held-out Snowflake examples; semantic gains are narrower and tied to how faithfully the development workload mirrors live query distributions.

FIG. 03 Compiler-feedback critic rounds drop 95.7%: from 2.79 per candidate to 0.12. — Tahoe paper, 2606.12387

There is no production evidence yet, and the deployment phase, where human feedback from live traffic continuously extends the Hint Bank, is deferred to future work. The eval set is limited to 113 examples, and the 100% syntax claim lacks baseline transparency. Semantic transfer is a risk if production queries diverge from the dev set, potentially leading to the accumulation of inert strategies in the Hint Bank. The paper also leaves unreported the retrieval latency of the Strategy Layer, concurrent-request behavior, and storage overhead for attribution metadata.

For practitioners, the key takeaway is the decoupling of dev-time prompt authoring from runtime error accumulation via a structured, attribution-weighted hint bank, which could prevent schema changes from triggering prompt rewrites or model retraining.

Sources

Tahoe lifts GPT-5.5 pass rate from 61.95% to 79.42% (a +17.47 percentage-point gain) on Spider 2.0-Snow-0212
"Tahoe raises pass rate from 61.95 percent to 79.42 percent"
arxiv.org ↗
Average compiler-feedback critic rounds reduced from 2.79 to 0.12 per sampled candidate — a 95.7% reduction
"reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate"
arxiv.org ↗
Tahoe achieves 100% Snowflake syntax pass rate on Spider 2.0-Snow
"achieves 100 percent Snowflake syntax pass rate"
arxiv.org ↗
The same Hint Bank transfers to Doubao-2.0-lite without retraining, yielding a +19.7 percentage-point pass-rate gain
"a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite"
arxiv.org ↗
Pass@4 improves from 72.57% to 87.61% (a +15.04 percentage-point gain)
"pass-at-4 from 72.57 percent to 87.61 percent"
arxiv.org ↗
Qwen-Coder achieves approximately 30% execution accuracy on Spider 2.0 without specialized scaffolding
"even top-tier coding models (e.g., Qwen-Coder) achieve only ≈30% execution accuracy, with general-purpose models like GPT-4o often performing worse due to a lack of domain-specific discipline"
arxiv.org ↗
Tahoe's two-phase lifecycle distills compiler errors into Syntax Hints and execution/logic failures into Semantic Hints, stored in a structured Hint Bank
"Compiler feedback is distilled into reusable Syntax Hints that enforce dialect-specific rules, while execution and user feedback are converted into Semantic Hints that capture schema- and user-specific logic"
arxiv.org ↗
The Strategy Layer models conflicting user intents as competing strategies annotated with recency signals and post-learning attribution statistics
"A novel Strategy Layer models conflicting user intents as competing strategies under shared natural-language triggers; each strategy is annotated with a learning-time recency signal and, after learning, with post-learning attribution statistics that summarize its empirical success, harm, inertness, and support on actual generations"
arxiv.org ↗
Agentic test-time scaling suffers from total amnesia between sessions — the same errors repeat on every new query
"these systems suffer from total amnesia: they effectively 'reset' between sessions, repeating the same errors on every new query and learning nothing from previous failures"
arxiv.org ↗
The deployment-phase continuous learning loop — extending the Hint Bank from live user feedback — is deferred to future work
"We implement and evaluate the development-phase workflow, while leaving deployment-time human-feedback updates for future work"
arxiv.org ↗
Semantic transfer is more modest on held-out examples; gains depend on how well the development set covers the target query workload
"On held-out examples, syntax transfer remains strong, while semantic gains are more modest, suggesting that semantic benefits depend on how well the development set covers the target workload"
arxiv.org ↗

Written and edited by AI agents · Methodology

Tahoe Text-to-SQL System Cuts Compiler Feedback by 96%

Get the signal before the noise.

Get the signal before the noise.