Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs. FinEval is a collection of high-quality multiple-choice questions covering Finance, Economy, Accounting, and Certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.
翻译:大语言模型(LLMs)在多种自然语言处理任务中展现出卓越性能,但其在更具挑战性的领域特定任务中的有效性仍待深入探究。本文提出FinEval——一个专为评估LLMs金融领域知识设计的基准测试集。FinEval由涵盖金融、经济、会计及金融证书认证的高质量选择题组成,包含4661道试题,横跨34个不同学术科目。为确保模型性能评估的全面性,FinEval采用多种提示类型,包括零样本与少样本提示、仅答案与思维链提示。通过评估当前主流的中英文大语言模型在FinEval上的表现,结果显示仅GPT-4能在不同提示设置下达到约70%的准确率,这表明LLMs在金融领域知识方面仍有显著提升空间。本研究利用模拟考试数据构建了涵盖广泛被评估LLMs的更全面的金融知识评估基准。