FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie,Weiguang Han,Zhengyu Chen,Ruoyu Xiang,Xiao Zhang,Yueru He,Mengxi Xiao,Dong Li,Yongfu Dai,Duanyu Feng,Yijing Xu,Haoqiang Kang,Ziyan Kuang,Chenhan Yuan,Kailai Yang,Zheheng Luo,Tianlin Zhang,Zhiwei Liu,Guojun Xiong,Zhiyang Deng,Yuechen Jiang,Zhiyuan Yao,Haohang Li,Yangyang Yu,Gang Hu,Jiajia Huang,Xiao-Yang Liu,Alejandro Lopez-Lira,Benyou Wang,Yanzhao Lai,Hao Wang,Min Peng,Sophia Ananiadou,Jimin Huang

from arxiv, 26 pages, 11 figures

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

翻译：大语言模型（LLMs）已变革自然语言处理（NLP）并在多个领域展现出潜力，然而由于缺乏全面的评估基准、LLMs的快速发展以及金融任务的复杂性，其在金融领域的潜力尚未得到充分探索。本文介绍了首个广泛的开源评估基准——FinBen，它包含涵盖24项金融任务的36个数据集，覆盖七个关键方面：信息抽取（IE）、文本分析、问答（QA）、文本生成、风险管理、预测与决策。FinBen提供了多项关键创新：更广泛的任务与数据集覆盖、首次股票交易评估、新颖的智能体与检索增强生成（RAG）评估，以及三个用于文本摘要、问答和股票交易的新型开源评估数据集。我们对15个代表性LLMs（包括GPT-4、ChatGPT及最新的Gemini）的评估揭示了若干关键发现：尽管LLMs在信息抽取和文本分析方面表现出色，但在高级推理及复杂任务（如文本生成和预测）上仍存在困难。GPT-4在信息抽取和股票交易中表现优异，而Gemini更擅长文本生成和预测。经过指令微调的LLMs提升了文本分析能力，但对问答等复杂任务的改进有限。FinBen已被用于在IJCAI-2024期间的FinNLP-AgentScen研讨会上主办首个金融大语言模型共享任务，吸引了12支团队参与。其新颖的解决方案超越了GPT-4，展现了FinBen推动金融大语言模型创新的潜力。所有数据集、结果和代码均已向研究社区公开：https://github.com/The-FinAI/PIXIU。