The rapid advancement of LLMs sparked significant interest in their potential to augment or automate managerial functions. One of the most recent trends in AI benchmarking is performance of Large Language Models (LLMs) over longer time horizons. While LLMs excel at tasks involving natural language and pattern recognition, their capabilities in multi-step, strategic business decision-making remain largely unexplored. Few studies demonstrated how results can be different from benchmarks in short-term tasks, as Vending-Bench revealed. Meanwhile, there is a shortage of alternative benchmarks for long-term coherence. This research analyses a novel benchmark using a business game for the decision making in business. The research contributes to the recent literature on AI by proposing a reproducible, open-access management simulator to the research community for LLM benchmarking. This novel framework is used for evaluating the performance of five leading LLMs available in free online interface: Gemini, ChatGPT, Meta AI, Mistral AI, and Grok. LLM makes decisions for a simulated retail company. A dynamic, month-by-month management simulation provides transparently in spreadsheet model as experimental environment. In each of twelve months, the LLMs are provided with a structured prompt containing a full business report from the previous period and are tasked with making key strategic decisions: pricing, order size, marketing budget, hiring, dismissal, loans, training expense, R&D expense, sales forecast, income forecast The methodology is designed to compare the LLMs on quantitative metrics: profit, revenue, and market share, and other KPIs. LLM decisions are analyzed in their strategic coherence, adaptability to market changes, and the rationale provided for their decisions. This approach allows to move beyond simple performance metrics for assessment of the long-term decision-making.
翻译:大型语言模型(LLMs)的快速发展引发了人们对其增强或自动化管理职能潜力的极大兴趣。人工智能基准测试的最新趋势之一是评估大型语言模型在较长时域内的性能。尽管LLMs在涉及自然语言和模式识别的任务上表现出色,但它们在多步骤、战略性商业决策方面的能力在很大程度上仍未得到探索。少数研究(如Vending-Bench所揭示的)已表明,在短期任务中其结果可能与基准测试存在差异。同时,目前缺乏针对长期一致性的替代性基准。本研究分析了一种使用商业游戏进行商业决策的新颖基准。本研究的贡献在于为研究社区提出了一个可复现、开放存取的管理模拟器,用于LLM基准测试。这一新颖框架被用于评估五种可通过免费在线界面获取的主流LLM:Gemini、ChatGPT、Meta AI、Mistral AI和Grok。LLM为一家模拟零售公司做出决策。一个动态的、逐月进行的管理模拟以电子表格模型作为实验环境,提供了透明度。在十二个月的每个月中,向LLM提供一个结构化提示,其中包含上一周期的完整业务报告,并要求其做出关键战略决策:定价、订单规模、营销预算、招聘、解雇、贷款、培训费用、研发费用、销售预测、收入预测。该方法旨在通过定量指标(利润、收入、市场份额及其他关键绩效指标)比较各LLM。同时分析LLM决策的战略一致性、对市场变化的适应性以及其决策提供的理由。这种方法使得评估能够超越简单的性能指标,深入考察长期决策能力。