Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.
翻译:文本到SQL集成通过生成多个SQL候选并选择其中一个来改进单候选生成的效果,但其有效性受限于Pass@K——即K个候选至少有一个正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集被相关错误主导。我们提出DivSkill-SQL,一种无需模型微调的残差技能优化框架,用于构建互补的智能体文本到SQL集成:每个新技能针对当前技能集成失败的样本进行优化,可证明地针对其Pass@K边际贡献进行优化。在Spider2-Lite上,DivSkill-SQL相比最强的集成基线,在Snowflake上选择准确率提升高达+11.1个百分点,在BigQuery上提升+8.3个百分点,且在两种基础模型(Opus-4.6和GPT-5.4)上表现一致。针对单一方言优化的技能无需重新训练即可跨方言(Snowflake、BigQuery、SQLite)迁移,并适用于不同任务形式(如BIRD-Critic提升+2.6个百分点)。错误诊断显示,幻觉模式引用和函数调用次数减少多达3倍,表明性能提升源自真正可靠的互补技能,而非表面形式变化。