Existing benchmarks for AI reasoning provide limited insight into how closely these capabilities resemble human reasoning in naturalistic contexts. We present an adaptation of the Watson & Holmes detective tabletop game as a new benchmark designed to evaluate reasoning performance using incrementally presented narrative evidence, open-ended questions and unconstrained language responses. An automated grading system was developed and validated against human assessors to enable scalable and replicable performance evaluation. Results show a clear improvement in AI model performance over time. Over nine months of 2025, model performance rose from the lower quartile of the human comparison group to approximately the top 5%. Around half of this improvement reflects steady advancement across successive model releases, while the remainder corresponds to a marked step change associated with reasoning-oriented model architectures. Systematic differences in the performance of AI models compared to humans, dependent on features of the specific detection puzzle, were mostly absent with the exception of a fall in performance for models when solving longer cases (case lengths being in the range of 1900-4000 words), and an advantage at inductive reasoning for reasoning models at early stages of case solving when evidence was scant.
翻译:现有的人工智能推理基准在揭示这些能力与人类在自然主义情境下的推理相似度方面提供的信息有限。我们提出一种改编自沃森与福尔摩斯侦探桌面游戏的新基准,该基准旨在通过渐进呈现的叙事证据、开放式问题和无约束语言响应来评估推理性能。我们开发了一个自动化评分系统,并通过与人类评估者的对比验证了其有效性,以实现可扩展且可复现的性能评估。结果表明,人工智能模型的性能随时间推移有明显提升。在2025年的九个月期间,模型性能从人类对照组的较低四分位数上升至约前5%。其中约一半的改进反映了连续模型版本间的稳步进展,而其余部分则对应着与面向推理的模型架构相关的显著阶跃变化。与人类相比,人工智能模型的性能在依赖于特定侦探谜题特征方面存在的系统性差异大多不显著,但存在两个例外:模型在解决较长案件(案件长度在1900-4000词范围内)时性能下降,以及在案件解决早期证据不足时,推理模型在归纳推理方面具有优势。