Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.
翻译:近期,大型语言模型在静态金融推理和简单动态交易任务中展现出卓越性能。然而,现有静态金融基准不足以评估大型语言模型在真实环境中的动态财富管理与金融决策能力。为弥合这一差距,我们提出FinBoardBench——一个基于三款经典金融棋盘游戏(现金流、并购大亨、大富翁)的评估套件。FinBoardBench评估了一整套金融技能,包括债务平衡下的个人现金流管理、企业投资与收购预测,以及涉及资产拍卖的竞争性贸易谈判。我们对9个先进大型语言模型的实验表明:尽管模型展现出基本的长远规划与投资逻辑,但它们未能有效利用复杂交互来获取利润,且其强大的静态推理能力并未转化为成功的动态决策。值得注意的是,模型倾向于优先获取即时资产而非维持充足流动性,这使其在随机事件引发的金融危机中不堪一击。我们希望FinBoardBench能为未来基于大型语言模型的更智能决策系统提供有价值的参考。