Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.
翻译:大型语言模型已被用于将自然语言问题翻译为SQL查询。由于缺乏对语法和数据库模式的硬性约束,它们偶尔会产生不可执行的无查询。这些失败限制了此类系统在实际场景中的应用。我们提出了一种神经符号框架,该框架通过基于统一的定子句文法施加SQL语法和模式约束,从而保证生成有效的查询。我们的框架还构建了与语言模型的双向接口,以利用其自然语言理解能力。在SQL语法子集上的评估结果表明,我们输出的所有查询均是有效的。这项工作是使用基于统一的文法扩展语言模型的第一步。我们证明这种扩展显著提高了底层语言模型的有效性、执行准确性和与真实结果的匹配度。我们的代码可在https://github.com/ML-KULeuven/deepstochlog-lm获取。