Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of Data Analysis, particularly with a focus on data-driven thinking, remain uncertain. To bridge this gap, we introduce BIBench, a comprehensive benchmark designed to evaluate the data analysis capabilities of LLMs within the context of Business Intelligence (BI). BIBench assesses LLMs across three dimensions: 1) BI foundational knowledge, evaluating the models' numerical reasoning and familiarity with financial concepts; 2) BI knowledge application, determining the models' ability to quickly comprehend textual information and generate analysis questions from multiple views; and 3) BI technical skills, examining the models' use of technical knowledge to address real-world data analysis challenges. BIBench comprises 11 sub-tasks, spanning three categories of task types: classification, extraction, and generation. Additionally, we've developed BIChat, a domain-specific dataset with over a million data points, to fine-tune LLMs. We will release BIBenchmark, BIChat, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. This benchmark aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of data analysis.
翻译:摘要:大型语言模型(LLMs)已在广泛任务中展现出卓越能力,但它们在数据分析这一专业领域(尤其是以数据驱动思维为核心)的熟练度和可靠性仍存在不确定性。为弥补这一空白,我们提出BIBench——一个旨在商业智能(BI)场景下评估LLM数据分析能力的全面基准测试框架。BIBench从三个维度评估LLM:1)BI基础知识,考察模型的数值推理能力与金融概念熟悉度;2)BI知识应用,检验模型快速理解文本信息并从多角度生成分析问题的能力;3)BI技术技能,评估模型运用技术知识解决现实数据分析挑战的能力。该基准包含11项子任务,涵盖分类、抽取和生成三类任务类型。此外,我们开发了包含超过百万数据点的领域特定数据集BIChat,用于微调LLM。相关资源(BIBenchmark、BIChat及评估脚本)将在\url{https://github.com/cubenlp/BIBench}公开。本基准旨在为深度分析LLM能力提供度量标准,并推动LLM在数据分析领域的发展。