As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at \url{https://github.com/IDEA-FinAI/Golden-Touchstone}, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.
翻译:随着大语言模型在金融领域的日益普及,迫切需要一种标准化方法来全面评估其性能。然而,现有的金融基准测试往往存在语言和任务覆盖范围有限、数据集质量不高以及大语言模型评估适应性不足等挑战。为应对这些局限,我们提出了"金试金石",这是首个面向金融大语言模型的综合性双语基准,它整合了中英文两种语言在八项核心金融自然语言处理任务中的代表性数据集。该基准基于广泛的开源数据收集和行业特定需求开发,包含多种金融任务,旨在全面评估模型的语言理解和生成能力。通过对GPT-4o、Llama3、FinGPT和FinMA等主流模型在该基准上的对比分析,我们揭示了它们在处理复杂金融信息方面的优势与局限。此外,我们开源了Touchstone-GPT——一个通过持续预训练和金融指令微调训练的金融大语言模型,该模型在双语基准上表现出色,但在特定任务中仍存在不足。本研究不仅为金融大语言模型提供了实用的评估工具,也为未来研究的开发和优化提供了指导。"金试金石"的源代码及Touchstone-GPT的模型权重已在\url{https://github.com/IDEA-FinAI/Golden-Touchstone}公开,以促进金融大语言模型的持续演进,并推动这一关键领域的进一步研究。