As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.
翻译:随着大语言模型Agent所处理任务日益复杂,一个关键问题浮现:它们能否在长周期内保持战略连贯性——包括在不确定性下进行规划、从延迟反馈中学习、以及在早期错误积累时进行适应性调整。我们提出了$\texttt{YC-Bench}$基准测试,通过要求Agent在涵盖数百轮交互的一年周期内运营一家模拟初创公司,来评估这些能力。Agent需要管理员工、选择任务合同,并在部分可观测环境中维持盈利能力——在该环境中,对抗性客户与不断增长的薪资支出会导致不良决策的复合后果。我们评估了12个模型(包括闭源与开源模型),每个模型在3个随机种子下测试。仅三个模型能持续超越20万美元的初始资本,其中Claude Opus 4.6以127万美元的平均最终资金总额位居榜首,而GLM-5以121万美元(推理成本降低11倍)紧随其后。作为跨上下文截断时唯一的信息持久化机制,暂存区的使用是成功的最强预测因子;而对抗性客户检测是最主要的失败模式,占破产案例的47%。我们的分析表明,前沿模型仍会通过不同的失败模式(如过度并行化)暴露短板,揭示了长周期性能的能力差距。$\texttt{YC-Bench}$具有开源、可复现及可配置等特性。