LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.
翻译:大语言模型(LLMs)已变革自然语言处理(NLP)并在多个领域展现出潜力,然而由于缺乏全面的评估基准、LLMs的快速发展以及金融任务的复杂性,其在金融领域的潜力尚未得到充分探索。本文介绍了首个广泛的开源评估基准——FinBen,它包含涵盖24项金融任务的36个数据集,覆盖七个关键方面:信息抽取(IE)、文本分析、问答(QA)、文本生成、风险管理、预测与决策。FinBen提供了多项关键创新:更广泛的任务与数据集覆盖、首次股票交易评估、新颖的智能体与检索增强生成(RAG)评估,以及三个用于文本摘要、问答和股票交易的新型开源评估数据集。我们对15个代表性LLMs(包括GPT-4、ChatGPT及最新的Gemini)的评估揭示了若干关键发现:尽管LLMs在信息抽取和文本分析方面表现出色,但在高级推理及复杂任务(如文本生成和预测)上仍存在困难。GPT-4在信息抽取和股票交易中表现优异,而Gemini更擅长文本生成和预测。经过指令微调的LLMs提升了文本分析能力,但对问答等复杂任务的改进有限。FinBen已被用于在IJCAI-2024期间的FinNLP-AgentScen研讨会上主办首个金融大语言模型共享任务,吸引了12支团队参与。其新颖的解决方案超越了GPT-4,展现了FinBen推动金融大语言模型创新的潜力。所有数据集、结果和代码均已向研究社区公开:https://github.com/The-FinAI/PIXIU。