CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.

翻译：大语言模型（LLMs）在各种自然语言处理任务上已取得显著性能，但其在更具挑战性和领域特定任务（如金融领域）的潜力尚未得到充分探索。本文提出CFinBench：一个精心构建的、迄今为止最全面的评估基准，用于在中文语境下评估大语言模型的金融知识。实践中，为更好地契合中国金融从业者的职业发展路径，我们构建了包含4个一级类别的系统性评估体系：（1）金融学科：检验大语言模型能否掌握金融学科（如经济学、统计学与审计学）必要的基础知识。（2）金融资质：检验大语言模型能否获取所需的金融执业资格认证（如注册会计师、证券从业资格与银行从业资格）。（3）金融实务：检验大语言模型能否胜任实际金融岗位工作（如税务咨询师、初级会计师与证券分析师）。（4）金融法规：检验大语言模型能否满足金融法律法规（如税法、保险法与经济法）的要求。CFinBench包含99,100道题目，涵盖43个二级类别，设置单选题、多选题与判断题三种题型。我们在CFinBench上对50个不同规模的典型大语言模型进行了广泛实验。结果表明，GPT4及部分中文优化模型在基准测试中领先，最高平均准确率为60.16%，凸显了CFinBench提出的挑战。数据集与评估代码已发布于https://cfinbench.github.io/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日