Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models - 专知论文

会员服务 ·

0

金融 · 基准 · 语言模型 · 金融大语言模型 · 大语言模型 ·

2025 年 12 月 8 日

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

翻译：黄金试金石：用于评估金融大语言模型的全面双语基准

Xiaojun Wu,Junxi Liu,Huanyi Su,Zhouchi Lin,Yiyan Qi,Chengjin Xu,Jiajun Su,Jiajie Zhong,Fuwei Wang,Saizhuo Wang,Fengrui Hua,Jia Li,Jian Guo

from arxiv, Published in Findings of EMNLP 2025

As large language models (LLMs) increasingly permeate the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. Existing financial benchmarks often suffer from limited language and task coverage, low-quality datasets, and inadequate adaptability for LLM evaluation. To address these limitations, we introduce Golden Touchstone, a comprehensive bilingual benchmark for financial LLMs, encompassing eight core financial NLP tasks in both Chinese and English. Developed from extensive open-source data collection and industry-specific demands, this benchmark thoroughly assesses models' language understanding and generation capabilities. Through comparative analysis of major models such as GPT-4o, Llama3, FinGPT, and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-source Touchstone-GPT, a financial LLM trained through continual pre-training and instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks. This research provides a practical evaluation tool for financial LLMs and guides future development and optimization. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone.

翻译：随着大语言模型（LLMs）日益渗透金融领域，迫切需要一种标准化方法来全面评估其性能。现有的金融基准常存在语言和任务覆盖范围有限、数据集质量低以及不适应LLM评估等问题。为应对这些局限，我们推出了黄金试金石，一个全面的金融LLM双语基准，涵盖中英文八项核心金融自然语言处理任务。该基准基于广泛的开源数据收集和行业特定需求开发，全面评估模型的语言理解与生成能力。通过对GPT-4o、Llama3、FinGPT和FinMA等主流模型的对比分析，揭示了它们在处理复杂金融信息时的优势与不足。此外，我们开源了Touchstone-GPT——一个通过持续预训练和指令微调训练的金融LLM，该模型在双语基准上表现出色，但在特定任务中仍存在局限。本研究为金融LLMs提供了实用的评估工具，并指导未来的开发与优化。黄金试金石的源代码及Touchstone-GPT的模型权重已公开于https://github.com/IDEA-FinAI/Golden-Touchstone。

0

相关内容

在社会经济生活，银行、证券或保险业者从市场主体募集资金，并投资给其它市场主体的经济活动。

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

专知会员服务

219+阅读 · 2019年12月18日

【ACL 2019 Tutorials】深度贝叶斯自然语言处理（Deep Bayesian Natural Language Processing），Jen-Tzung Chien

【ACL 2019 Tutorials】深度贝叶斯自然语言处理（Deep Bayesian Natural Language Processing），Jen-Tzung Chien

专知会员服务

48+阅读 · 2019年11月17日

【ACL 2019 Tutorials】政治文本的计算性分析：沟通不同领域的研究成果（Computational Analysis of Political Texts: Bridging Research Efforts Across Communities），GoranGlavaš,Federico Nanni,Simone Paolo Ponzetto

【ACL 2019 Tutorials】政治文本的计算性分析：沟通不同领域的研究成果（Computational Analysis of Political Texts: Bridging Research Efforts Across Communities），GoranGlavaš,Federico Nanni,Simone Paolo Ponzetto

专知会员服务

10+阅读 · 2019年11月17日

【自然语言处理快速入门】《Natural Language Processing: A Crash Course!》by Shantanu Phadke

【自然语言处理快速入门】《Natural Language Processing: A Crash Course!》by Shantanu Phadke

专知会员服务

38+阅读 · 2019年11月2日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

【翻译技术速递】入门教程：Trados 翻译记忆库工具

【翻译技术速递】入门教程：Trados 翻译记忆库工具

翻译技术沙龙

38+阅读 · 2019年11月28日

预知未来——Gluon 时间序列工具包（GluonTS）

预知未来——Gluon 时间序列工具包（GluonTS）

ApacheMXNet

24+阅读 · 2019年6月25日

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

全球人工智能

10+阅读 · 2019年5月5日

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

专知

54+阅读 · 2019年4月12日

Auto-Keras与AutoML：入门指南

Auto-Keras与AutoML：入门指南

云栖社区

18+阅读 · 2019年2月9日

TensorFlow 2.0深度强化学习指南

TensorFlow 2.0深度强化学习指南

云栖社区

18+阅读 · 2019年2月1日

NLP-Progress记录NLP最新数据集、论文和代码: 助你紧跟NLP前沿

NLP-Progress记录NLP最新数据集、论文和代码: 助你紧跟NLP前沿

专知

17+阅读 · 2018年11月15日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

汉英篇章衔接对齐资源构建与分析研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于形态和多词的有限语料蒙汉互译调序优化方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

Banach空间的嵌入理论及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

柬埔寨语命名实体识别及汉柬双语可比语料库构建方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

波动率微笑：隐含信息与动态建模

国家自然科学基金

2+阅读 · 2014年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

176+阅读 · 2023年4月20日

A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material

Arxiv

88+阅读 · 2023年4月4日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

501+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

156+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Knowledge Graphs: Opportunities and Challenges

Arxiv

182+阅读 · 2023年3月24日

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Arxiv

51+阅读 · 2023年3月22日

A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?

Arxiv

88+阅读 · 2023年3月21日

VIP会员

文章信息

相关主题

金融大语言模型

大语言模型

最新内容

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

2+阅读 · 今天11:43

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

2+阅读 · 今天11:41

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

5+阅读 · 今天6:30

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

5+阅读 · 今天6:18

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

6+阅读 · 今天6:08

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

6+阅读 · 今天5:54

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

7+阅读 · 今天5:22

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

7+阅读 · 今天5:15

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

7+阅读 · 今天3:42

综述 | 从问答到任务完成：Agent系统与Harness设计

综述 | 从问答到任务完成：Agent系统与Harness设计

专知会员服务

5+阅读 · 6月24日

Agentic RL：框架、实践与长程智能体训练

Agentic RL：框架、实践与长程智能体训练

专知会员服务

7+阅读 · 6月24日

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

反无人机拦截器训练与运用课程：对美国陆军部队发展的启示

专知会员服务

10+阅读 · 6月24日

重新思考无人机时代的生存能力

重新思考无人机时代的生存能力

专知会员服务

9+阅读 · 6月24日

装甲突击旅：现代战争思考、战斗与组织

装甲突击旅：现代战争思考、战斗与组织

专知会员服务

7+阅读 · 6月24日

在人工智能加速决策环境中拓展OODA循环

在人工智能加速决策环境中拓展OODA循环

专知会员服务

9+阅读 · 6月24日

相关VIP内容

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

【微软-ACL2020】TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

专知会员服务

36+阅读 · 2020年4月14日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

【Google Research】Wavesplit:通过说话者聚类实现端到端的语音分离，Wavesplit: End-to-End Speech Separation by Speaker Clustering

专知会员服务

19+阅读 · 2020年2月26日

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

专知会员服务

219+阅读 · 2019年12月18日

【ACL 2019 Tutorials】深度贝叶斯自然语言处理（Deep Bayesian Natural Language Processing），Jen-Tzung Chien

【ACL 2019 Tutorials】深度贝叶斯自然语言处理（Deep Bayesian Natural Language Processing），Jen-Tzung Chien

专知会员服务

48+阅读 · 2019年11月17日

【ACL 2019 Tutorials】政治文本的计算性分析：沟通不同领域的研究成果（Computational Analysis of Political Texts: Bridging Research Efforts Across Communities），GoranGlavaš,Federico Nanni,Simone Paolo Ponzetto

【ACL 2019 Tutorials】政治文本的计算性分析：沟通不同领域的研究成果（Computational Analysis of Political Texts: Bridging Research Efforts Across Communities），GoranGlavaš,Federico Nanni,Simone Paolo Ponzetto

专知会员服务

10+阅读 · 2019年11月17日

【自然语言处理快速入门】《Natural Language Processing: A Crash Course!》by Shantanu Phadke

【自然语言处理快速入门】《Natural Language Processing: A Crash Course!》by Shantanu Phadke

专知会员服务

38+阅读 · 2019年11月2日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

网状网络及其在军事领域的运用

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

相关资讯

【翻译技术速递】入门教程：Trados 翻译记忆库工具

【翻译技术速递】入门教程：Trados 翻译记忆库工具

翻译技术沙龙

38+阅读 · 2019年11月28日

预知未来——Gluon 时间序列工具包（GluonTS）

预知未来——Gluon 时间序列工具包（GluonTS）

ApacheMXNet

24+阅读 · 2019年6月25日

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

全球人工智能

10+阅读 · 2019年5月5日

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

TensorFlow 2.0官方Transformer教程 (Attention is All you Need)

专知

54+阅读 · 2019年4月12日

Auto-Keras与AutoML：入门指南

Auto-Keras与AutoML：入门指南

云栖社区

18+阅读 · 2019年2月9日

TensorFlow 2.0深度强化学习指南

TensorFlow 2.0深度强化学习指南

云栖社区

18+阅读 · 2019年2月1日

NLP-Progress记录NLP最新数据集、论文和代码: 助你紧跟NLP前沿

NLP-Progress记录NLP最新数据集、论文和代码: 助你紧跟NLP前沿

专知

17+阅读 · 2018年11月15日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

相关论文

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

176+阅读 · 2023年4月20日

A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material

Arxiv

88+阅读 · 2023年4月4日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

501+阅读 · 2023年3月31日

Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services

Arxiv

156+阅读 · 2023年3月29日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

64+阅读 · 2023年3月29日

Nature Language Reasoning, A Survey

Arxiv

83+阅读 · 2023年3月26日

Knowledge Graphs: Opportunities and Challenges

Arxiv

182+阅读 · 2023年3月24日

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Arxiv

51+阅读 · 2023年3月22日

A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to GPT-5 All You Need?

Arxiv

88+阅读 · 2023年3月21日

相关基金

汉英篇章衔接对齐资源构建与分析研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于形态和多词的有限语料蒙汉互译调序优化方法

国家自然科学基金

0+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

47+阅读 · 2015年12月31日

面向汉语-泰语跨语言新闻事件检索方法研究

国家自然科学基金

2+阅读 · 2014年12月31日

Banach空间的嵌入理论及其应用

国家自然科学基金

1+阅读 · 2014年12月31日

柬埔寨语命名实体识别及汉柬双语可比语料库构建方法研究

国家自然科学基金

0+阅读 · 2014年12月31日

波动率微笑：隐含信息与动态建模

国家自然科学基金

2+阅读 · 2014年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

基于组合Hodge理论的图像视频质量评价方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员