Calibration is crucial as large language models (LLMs) are increasingly deployed to convert natural language queries into SQL for commercial databases. In this work, we investigate calibration techniques for assigning confidence to generated SQL queries. We show that a straightforward baseline -- deriving confidence from the model's full-sequence probability -- outperforms recent methods that rely on follow-up prompts for self-checking and confidence verbalization. Our comprehensive evaluation, conducted across two widely-used Text-to-SQL benchmarks and multiple LLM architectures, provides valuable insights into the effectiveness of various calibration strategies.
翻译:随着大型语言模型(LLM)越来越多地被部署用于将自然语言查询转换为商业数据库的SQL,校准变得至关重要。在本研究中,我们探讨了为生成的SQL查询分配置信度的校准技术。我们证明,一个简单的基线方法——从模型的完整序列概率中推导置信度——优于最近依赖后续提示进行自我检查和置信度言语化的方法。我们在两个广泛使用的文本到SQL基准和多种LLM架构上进行的全面评估,为各种校准策略的有效性提供了有价值的见解。