Over the past three years, the financial services industry has witnessed Large Language Models (LLMs) and agents transitioning from the exploration stage to readiness and governance stages. Financial large language models (FinLLMs), such as open FinGPT and proprietary BloombergGPT , have great potential in financial applications, including retrieving real-time data, tutoring, analyzing sentiment of social media, analyzing SEC filings, and agentic trading. However, general-purpose LLMs and agents lack financial expertise and often struggle to handle complex financial reasoning. This paper presents an evaluation and benchmarking suite that covers the lifecycle of FinLLMs and FinAgents. This suite led by SecureFinAI Lab includes an evaluation pipeline and a governance framework collaborating with Linux Foundation and PyTorch Foundation, a FinLLM Leaderboard with HuggingFace, an AgentOps framework with Red Hat, and a documentation website with Rensselear Center of Open Source. Our collaborative development evolves through three stages: FinLLM Exploration (2023), FinLLM Readiness (2024), and FinAI Governance (2025). The proposed suite serves as an open platform that enables researchers and practitioners to perform both quantitative and qualitative analysis of different FinLLMs and FinAgents, fostering a more robust and reliable FinAI ecosystem.
翻译:过去三年间,金融服务行业见证了大型语言模型(LLMs)及其智能体从探索阶段逐步过渡到就绪与治理阶段。以开源FinGPT与专有BloombergGPT为代表的金融大语言模型(FinLLMs),在实时数据检索、投资教育、社交媒体情感分析、美国证券交易委员会文件解析及智能体交易等金融应用场景中展现出巨大潜力。然而,通用型LLMs与智能体缺乏金融领域专业知识,在处理复杂金融推理任务时常面临困难。本文提出一套覆盖FinLLMs与金融智能体全生命周期的评估基准套件。该套件由SecureFinAI实验室主导开发,包含:与Linux基金会及PyTorch基金会协作构建的评估流程与治理框架、与HuggingFace共建的FinLLM性能排行榜、与红帽合作的AgentOps框架,以及与伦斯勒开源中心联合建设的文档网站。我们的协同开发历经三个阶段演进:FinLLM探索阶段(2023年)、FinLLM就绪阶段(2024年)及FinAI治理阶段(2025年)。本套件作为开放平台,使研究者与实践者能够对不同FinLLMs及金融智能体进行定量与定性分析,从而构建更稳健可靠的金融人工智能生态系统。