We introduce SECQUE, a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks. SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories: comparison analysis, ratio calculation, risk assessment, and financial insight generation. To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges, which demonstrates strong alignment with human evaluations. Additionally, we provide an extensive analysis of various models' performance on our benchmark. By making SECQUE publicly available, we aim to facilitate further research and advancements in financial AI.
翻译:我们提出了SECQUE,这是一个用于评估大型语言模型(LLM)在金融分析任务中性能的综合基准。SECQUE包含565个专家编写的问题,涵盖美国证券交易委员会(SEC)文件分析的四个关键类别:比较分析、比率计算、风险评估和财务洞察生成。为了评估模型性能,我们开发了SECQUE-Judge,这是一种利用多个基于LLM的评判器的评估机制,该机制显示出与人工评估的高度一致性。此外,我们对各种模型在我们基准上的表现进行了广泛分析。通过公开提供SECQUE,我们旨在促进金融人工智能领域的进一步研究和进展。