In recent years, Large Language Models (LLMs) have demonstrated remarkable versatility across various applications, including natural language understanding, domain-specific knowledge tasks, etc. However, applying LLMs to complex, high-stakes domains like finance requires rigorous evaluation to ensure reliability, accuracy, and compliance with industry standards. To address this need, we conduct a comprehensive and comparative study on three state-of-the-art LLMs, GLM-4, Mistral-NeMo, and LLaMA3.1, focusing on their effectiveness in generating automated financial reports. Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. By examining each model's capabilities, we aim to provide an insightful assessment of their strengths and limitations. Our paper offers benchmarks for financial report analysis, encompassing proposed metrics such as ROUGE-1, BERT Score, and LLM Score. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality. Additionally, we make our financial dataset publicly available, inviting researchers and practitioners to leverage, scrutinize, and enhance our findings through broader community engagement and collaborative improvement. Our dataset is available on huggingface.
翻译:近年来,大语言模型(LLMs)在包括自然语言理解、领域特定知识任务等在内的多种应用中展现出卓越的通用性。然而,将LLMs应用于金融这类复杂且高风险的领域,需要进行严格的评估,以确保其可靠性、准确性并符合行业标准。为满足这一需求,我们对三种最先进的LLMs——GLM-4、Mistral-NeMo和LLaMA3.1——进行了一项全面且比较性的研究,重点关注它们在生成自动化财务报告方面的有效性。我们的主要动机是探索如何在金融这一要求精确性、上下文相关性以及对错误或误导性信息具有鲁棒性的领域中利用这些模型。通过考察每个模型的能力,我们旨在对其优势和局限性提供深入的评估。本文为财务报告分析提供了基准,涵盖ROUGE-1、BERT Score和LLM Score等提出的指标。我们引入了一个创新的评估框架,该框架整合了定量指标(如精确率、召回率)和定性分析(如上下文契合度、一致性),以提供每个模型输出质量的全景视图。此外,我们公开了我们的财务数据集,邀请研究者和从业者通过更广泛的社区参与和协作改进来利用、审查并增强我们的发现。我们的数据集可在huggingface上获取。