Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex, high-stakes domains like finance requires rigorous evaluation to ensure reliability, accuracy, and compliance with industry standards. To address this need, we conduct a comprehensive and comparative study on three state-of-the-art LLMs, GLM-4, Mistral-NeMo, and LLaMA3.1, focusing on their effectiveness in generating automated financial reports. Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. By examining each model's capabilities, we aim to provide an insightful assessment of their strengths and limitations. Our paper offers benchmarks for financial report analysis, encompassing proposed metrics such as ROUGE-1, BERT Score, and LLM Score. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality. Additionally, we make our financial dataset publicly available, inviting researchers and practitioners to leverage, scrutinize, and enhance our findings through broader community engagement and collaborative improvement. Our dataset is available on huggingface.

翻译：近年来，大语言模型（LLMs）在包括自然语言理解、领域特定知识任务等在内的多种应用中展现出卓越的通用性。然而，将LLMs应用于金融这类复杂且高风险的领域，需要进行严格的评估，以确保其可靠性、准确性并符合行业标准。为满足这一需求，我们对三种最先进的LLMs——GLM-4、Mistral-NeMo和LLaMA3.1——进行了一项全面且比较性的研究，重点关注它们在生成自动化财务报告方面的有效性。我们的主要动机是探索如何在金融这一要求精确性、上下文相关性以及对错误或误导性信息具有鲁棒性的领域中利用这些模型。通过考察每个模型的能力，我们旨在对其优势和局限性提供深入的评估。本文为财务报告分析提供了基准，涵盖ROUGE-1、BERT Score和LLM Score等提出的指标。我们引入了一个创新的评估框架，该框架整合了定量指标（如精确率、召回率）和定性分析（如上下文契合度、一致性），以提供每个模型输出质量的全景视图。此外，我们公开了我们的财务数据集，邀请研究者和从业者通过更广泛的社区参与和协作改进来利用、审查并增强我们的发现。我们的数据集可在huggingface上获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日