WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

翻译：我们提出了一个新颖的数据集WhoDunIt，用于评估大型语言模型在叙事语境中的演绎推理能力。该数据集构建自开放领域的悬疑小说和短篇故事，要求LLM在阅读并理解故事后识别出罪犯。为了评估模型的鲁棒性，我们应用了一系列角色级别的名称增强方法，包括原始姓名、姓名互换以及用广为人知的现实和/或虚构实体进行替换。我们进一步采用多种提示风格来研究提示方式对演绎推理准确性的影响。我们使用最先进的模型（具体为GPT-4o、GPT-4-turbo和GPT-4o-mini）进行了评估研究，通过多次试验并采用多数响应选择以确保结果的可靠性。结果表明，虽然LLM在未经修改的文本上表现可靠，但在某些名称替换（尤其是那些具有广泛认知度的替换）下，准确性会下降。该数据集已在此公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日