Translating natural language to SQL for data retrieval has become more accessible thanks to code generation LLMs. But how hard is it to generate SQL code? While databases can become unbounded in complexity, the complexity of queries is bounded by real life utility and human needs. With a sample of 376 databases, we show that SQL queries, as translations of natural language questions are finite in practical complexity. There is no clear monotonic relationship between increases in database table count and increases in complexity of SQL queries. In their template forms, SQL queries follow a Power Law-like distribution of frequency where 70% of our tested queries can be covered with just 13% of all template types, indicating that the high majority of SQL queries are predictable. This suggests that while LLMs for code generation can be useful, in the domain of database access, they may be operating in a narrow, highly formulaic space where templates could be safer, cheaper, and auditable.
翻译:借助代码生成型大语言模型(LLM),将自然语言转化为SQL语句进行数据检索已变得更为便捷。但生成SQL代码的难度究竟如何?尽管数据库的复杂度可能无限增长,但查询本身的复杂度却受限于实际应用场景与人类需求。通过对376个数据库样本的分析,我们证明:作为自然语言问题翻译结果的SQL查询,其实际复杂度是有限的。数据库表数量的增加与SQL查询复杂度的提升之间并不存在明确的单调关系。从模板形式来看,SQL查询遵循类似幂律分布的频率规律——仅13%的模板类型即可覆盖70%的测试查询,这表明绝大多数SQL查询具有可预测性。这意味着,尽管LLM在代码生成领域具有实用价值,但在数据库访问这一特定场景中,它们可能仅运行于一个狭窄且高度公式化的空间内,而模板方案在此类场景中反而具备更优的安全性、更低的成本及更强的可审计性。