Evaluating the decision-making capabilities of large language models (LLMs) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings. These evaluations overlook the defining challenge of real executive decision-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies. We introduce \textsc{CEO-Bench}, a multi-agent benchmark that evaluates LLMs on CEO-level strategic resource reallocation -- the process of redirecting capital across business units in a multi-round, constraint-rich organizational environment. In \textsc{CEO-Bench}, LLM agents receive conflicting advice from four role-conditioned C-suite advisors (CFO, CTO, COO, CMO), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history-sensitive judgment, and plan validity. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration -- the hardest capability layer. We identify systematic failure modes including single-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration-boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action. These findings delineate the current capability boundary of LLMs as organizational decision-makers and inform the design of future AI-assisted executive systems.
翻译:评估大语言模型(LLMs)的决策能力是当前日益重要的研究方向,然而现有基准测试仅聚焦于孤立认知任务,例如在简化场景中的推理、知识检索和经济理性。这些评估忽视了真实高管决策的核心挑战:在信息不对称、组织约束和时间依赖条件下整合来自专业利益相关方的矛盾建议。我们提出\textsc{CEO-Bench}——一个多智能体基准测试,用于评估LLMs在CEO级战略资源再配置(即在多轮、约束密集的组织环境中跨业务单元重新分配资本)上的表现。在\textsc{CEO-Bench}中,LLM智能体接收来自四位角色化C级顾问(CFO、CTO、COO、CMO)的矛盾建议,每位顾问拥有私有信号和差异化优先级,智能体需将这些建议综合为具体分配方案,该方案从四个维度进行评估:角色整合、条件性果断、历史敏感性判断和方案有效性。针对13个场景对五种前沿模型进行的实验表明,所有模型均能实现高结构有效性,但在战略性校准(最具挑战性的能力层)上表现显著分化。我们识别出系统性失败模式,包括单顾问主导、模糊情境下的保守默认和历史遗忘,并揭示了结构性整合与果断性之间的权衡:更深入参与矛盾视角的模型往往产生较不果断的决策。这些发现划定了LLMs作为组织决策者当前的能力边界,并为未来AI辅助高管系统的设计提供参考。