TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

翻译：大型语言模型（LLMs）通过文本到SQL技术实现了数据库访问的 democratization，但从原型到生产环境的过渡仍然存在困难。实际部署必须处理严格的SQL方言、大规模模式和不断变化的用户偏好，而监督式微调成本高昂且缺乏灵活性，智能体测试时扩展则代价不菲。我们提出Tahoe系统，将提示优化视为动态数据管理问题。Tahoe采用开发与部署阶段的错误驱动提示学习流水线，将调试痕迹整合为结构化提示库。编译器反馈被提炼为语法提示，用于处理方言特定规则；执行与用户反馈则转化为语义提示，用于处理模式与用户特定逻辑。Tahoe进一步引入策略层，将冲突的用户意图建模为共享自然语言触发条件下的竞争策略，并利用近期信号与事后归因统计（总结经验成功、损害、惰性及支持度）进行优化。推理时，Tahoe检索相关提示，引导LLM依次完成逻辑规划与SQL合成。我们实现并评估了开发阶段的工作流，将部署阶段的人机反馈更新留待未来研究。在Spider 2.0-Snow数据集上，Tahoe在不更新模型参数的情况下显著提升了文本到SQL性能。在113个受监督的Spider 2.0-Snow-0212示例中（使用GPT-5.5），Tahoe将通过率从61.95%提升至79.42%，Top-4通过率从72.57%提升至87.61%，实现了100%的Snowflake语法通过率，并将每个采样候选的平均编译器反馈修正轮次从2.79降至0.12。同一提示库还能迁移至较弱基座模型，例如为Doubao-2.0-lite带来19.7个百分点的通过率增益。