As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.
翻译:随着大语言模型(LLMs)日益渗透金融领域,迫切需要一种标准化方法来全面评估其性能。现有的金融基准常存在语言和任务覆盖范围有限、数据集质量低以及不适应LLM评估等问题。为应对这些局限,我们推出了黄金试金石,一个全面的金融LLM双语基准,涵盖中英文八项核心金融自然语言处理任务。该基准基于广泛的开源数据收集和行业特定需求开发,全面评估模型的语言理解与生成能力。通过对GPT-4o、Llama3、FinGPT和FinMA等主流模型的对比分析,揭示了它们在处理复杂金融信息时的优势与不足。此外,我们开源了Touchstone-GPT——一个通过持续预训练和指令微调训练的金融LLM,该模型在双语基准上表现出色,但在特定任务中仍存在局限。本研究为金融LLMs提供了实用的评估工具,并指导未来的开发与优化。黄金试金石的源代码及Touchstone-GPT的模型权重已公开于https://github.com/IDEA-FinAI/Golden-Touchstone。