Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. However, their proficiency and reliability in the specialized domain of Data Analysis, particularly with a focus on data-driven thinking, remain uncertain. To bridge this gap, we introduce BIBench, a comprehensive benchmark designed to evaluate the data analysis capabilities of LLMs within the context of Business Intelligence (BI). BIBench assesses LLMs across three dimensions: 1) BI foundational knowledge, evaluating the models' numerical reasoning and familiarity with financial concepts; 2) BI knowledge application, determining the models' ability to quickly comprehend textual information and generate analysis questions from multiple views; and 3) BI technical skills, examining the models' use of technical knowledge to address real-world data analysis challenges. BIBench comprises 11 sub-tasks, spanning three categories of task types: classification, extraction, and generation. Additionally, we've developed BIChat, a domain-specific dataset with over a million data points, to fine-tune LLMs. We will release BIBenchmark, BIChat, and the evaluation scripts at \url{https://github.com/cubenlp/BIBench}. This benchmark aims to provide a measure for in-depth analysis of LLM abilities and foster the advancement of LLMs in the field of data analysis.
翻译:大语言模型在广泛任务中展现出令人瞩目的能力,但其在数据分析这一专业领域,尤其是以数据驱动思维为核心的场景中的熟练度和可靠性仍不明确。为弥补这一空白,我们提出BIBench——一个专为评估大语言模型在商业智能情境下数据分析能力而设计的综合性基准测试。BIBench从三个维度进行评估:1)BI基础知识,评估模型的数值推理能力及对金融概念的熟悉程度;2)BI知识应用,检验模型快速理解文本信息并从多视角生成分析问题的能力;3)BI技术技能,考察模型运用技术知识解决实际数据分析挑战的能力。BIBench包含11个子任务,涵盖分类、抽取和生成三类任务类型。此外,我们开发了BIChat——一个包含超过百万数据点的领域特定数据集,用于微调大语言模型。我们将公开BIBenchmark、BIChat及评估脚本(\url{https://github.com/cubenlp/BIBench})。该基准测试旨在为大语言模型能力的深度分析提供衡量标准,并推动大语言模型在数据分析领域的发展。