Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.
翻译:自然语言数据库接口旨在将用户问题转化为可执行的SQL,但在用户问题表述不明确、数据库模式庞大且存在歧义的真实场景中,其鲁棒性仍然不足。用户问题、数据库模式与模型解释之间的歧义是NL2SQL中的核心失效模式,会导致意图错位、模式错误定位及SQL生成错误。现有方法依赖人工澄清或仅将歧义视为模式表示问题,但这些方法既无法扩展也无法自主消歧。本文提出SOMA-SQL,通过定向合成查询日志和歧义驱动的探测来自动消歧。SOMA-SQL首先构建合成查询日志以确立模式解释基础并引导候选SQL生成;随后基于结构化歧义分类体系与候选不一致性执行定向探测查询,为最终SQL选择与修复提供歧义消除证据。这种主动的歧义发现与消解方法无需人工参与即可泛化至未见过的模式与查询分布。在六个公开基准上的实验表明,SOMA-SQL的平均执行准确率较现有最优基线提升13.0%,在歧义性问题上的提升幅度最高达16.7%。