Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to better align with the career trajectory of Chinese financial practitioners, we build a systematic evaluation from 4 first-level categories: (1) Financial Subject: whether LLMs can memorize the necessary basic knowledge of financial subjects, such as economics, statistics and auditing. (2) Financial Qualification: whether LLMs can obtain the needed financial qualified certifications, such as certified public accountant, securities qualification and banking qualification. (3) Financial Practice: whether LLMs can fulfill the practical financial jobs, such as tax consultant, junior accountant and securities analyst. (4) Financial Law: whether LLMs can meet the requirement of financial laws and regulations, such as tax law, insurance law and economic law. CFinBench comprises 99,100 questions spanning 43 second-level categories with 3 question types: single-choice, multiple-choice and judgment. We conduct extensive experiments of 50 representative LLMs with various model size on CFinBench. The results show that GPT4 and some Chinese-oriented models lead the benchmark, with the highest average accuracy being 60.16%, highlighting the challenge presented by CFinBench. The dataset and evaluation code are available at https://cfinbench.github.io/.
翻译:大语言模型(LLMs)在各种自然语言处理任务上已取得显著性能,但其在更具挑战性和领域特定任务(如金融领域)的潜力尚未得到充分探索。本文提出CFinBench:一个精心构建的、迄今为止最全面的评估基准,用于在中文语境下评估大语言模型的金融知识。实践中,为更好地契合中国金融从业者的职业发展路径,我们构建了包含4个一级类别的系统性评估体系:(1)金融学科:检验大语言模型能否掌握金融学科(如经济学、统计学与审计学)必要的基础知识。(2)金融资质:检验大语言模型能否获取所需的金融执业资格认证(如注册会计师、证券从业资格与银行从业资格)。(3)金融实务:检验大语言模型能否胜任实际金融岗位工作(如税务咨询师、初级会计师与证券分析师)。(4)金融法规:检验大语言模型能否满足金融法律法规(如税法、保险法与经济法)的要求。CFinBench包含99,100道题目,涵盖43个二级类别,设置单选题、多选题与判断题三种题型。我们在CFinBench上对50个不同规模的典型大语言模型进行了广泛实验。结果表明,GPT4及部分中文优化模型在基准测试中领先,最高平均准确率为60.16%,凸显了CFinBench提出的挑战。数据集与评估代码已发布于https://cfinbench.github.io/。