This study introduces a new long-form database question answering dataset designed to evaluate how Large Language Models (LLMs) interact with a SQL interpreter. The task necessitates LLMs to strategically generate multiple SQL queries to retrieve sufficient data from a database, to reason with the acquired context, and to synthesize them into a comprehensive analytical narrative. Our findings highlight that this task poses great challenges even for the state-of-the-art GPT-4 model. We propose and evaluate two interaction strategies, and provide a fine-grained analysis of the individual stages within the interaction. A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries. To address the challenge of accurately assessing answer quality, we introduce a multi-agent evaluation framework that simulates the academic peer-review process, enhancing the precision and reliability of our evaluations. This framework allows for a more nuanced understanding of the strengths and limitations of current LLMs in complex retrieval and reasoning tasks.
翻译:本研究引入了一个新型长格式数据库问答数据集,旨在评估大语言模型(LLM)与SQL解释器交互的能力。该任务要求LLM策略性地生成多个SQL查询以从数据库中检索足够数据,基于获取的上下文进行推理,并将这些信息综合成全面的分析性叙述。我们的发现强调,即使对于最先进的GPT-4模型,这项任务也构成巨大挑战。我们提出并评估了两种交互策略,并对交互过程中的各个阶段进行了细粒度分析。一项关键发现是识别出了阻碍有效交互的两个主要瓶颈:规划能力和生成多个SQL查询的能力。为应对准确评估答案质量的挑战,我们引入了一个模拟学术同行评审过程的多智能体评估框架,从而提升了评估的精确性和可靠性。该框架使我们对当前LLM在复杂检索与推理任务中的优势与局限有了更深入的理解。