We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.
翻译:我们提出了首个版本的KWBench(知识工作基准),这是一个用于评估大语言模型无提示问题识别能力的基准:即LLM能否在尝试解决问题之前先识别出专业场景。现有前沿基准已趋于饱和,且当前多数知识工作评估简化为信息抽取或按规格完成任务。KWBench聚焦于更前一步的环节:仅凭原始输入识别情境的底层结构。该基准包含223个任务,数据源自并购、合同谈判、临床药学、组织政治、欺诈分析和激励设计等领域的从业者。每个任务编码了一个形式博弈论模式(委托-代理冲突、信号传递、机制设计失败、策略性遗漏、联盟动力学、策略性相互依赖),并附带结构化真值标注,记录专家对情境的解读及预期失败模式。模型接收原始数据和任务提示,但未指示问题类型。评分采用三层次分级体系,由强制性联合检查把关。强制性标准编码了预测的错误路径。我们评估了16个模型,最佳模型仅在27.9%的任务上通过检查。排名前两位的模型在通过检查的任务上仅有31.7%的共识。在前8个模型中,有44个任务仅被单个模型解决;通过路由组合前8个模型可覆盖50.7%的基准测试,近乎最佳单一模型的两倍。在有条件通过时,质量得分趋于一致(各模型约83%),而无条件得分则非如此。相同模型在明确提示时能正确阐述相关博弈论概念,却在无提示时无法应用。我们发布KWBench,旨在转变前沿模型在知识工作上的评估方式:评分依据模型能否仅凭情境识别正确问题,而不仅是在问题已被框定后的执行表现。