Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at \url{https://github.com/IDEA-FinAI/Golden-Touchstone}, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.

翻译：随着大语言模型在金融领域的日益普及，迫切需要一种标准化方法来全面评估其性能。然而，现有的金融基准测试往往存在语言和任务覆盖范围有限、数据集质量不高以及大语言模型评估适应性不足等挑战。为应对这些局限，我们提出了"金试金石"，这是首个面向金融大语言模型的综合性双语基准，它整合了中英文两种语言在八项核心金融自然语言处理任务中的代表性数据集。该基准基于广泛的开源数据收集和行业特定需求开发，包含多种金融任务，旨在全面评估模型的语言理解和生成能力。通过对GPT-4o、Llama3、FinGPT和FinMA等主流模型在该基准上的对比分析，我们揭示了它们在处理复杂金融信息方面的优势与局限。此外，我们开源了Touchstone-GPT——一个通过持续预训练和金融指令微调训练的金融大语言模型，该模型在双语基准上表现出色，但在特定任务中仍存在不足。本研究不仅为金融大语言模型提供了实用的评估工具，也为未来研究的开发和优化提供了指导。"金试金石"的源代码及Touchstone-GPT的模型权重已在\url{https://github.com/IDEA-FinAI/Golden-Touchstone}公开，以促进金融大语言模型的持续演进，并推动这一关键领域的进一步研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日