We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.
翻译:我们提出了一个新颖的数据集WhoDunIt,用于评估大型语言模型在叙事语境中的演绎推理能力。该数据集构建自开放领域的悬疑小说和短篇故事,要求LLM在阅读并理解故事后识别出罪犯。为了评估模型的鲁棒性,我们应用了一系列角色级别的名称增强方法,包括原始姓名、姓名互换以及用广为人知的现实和/或虚构实体进行替换。我们进一步采用多种提示风格来研究提示方式对演绎推理准确性的影响。我们使用最先进的模型(具体为GPT-4o、GPT-4-turbo和GPT-4o-mini)进行了评估研究,通过多次试验并采用多数响应选择以确保结果的可靠性。结果表明,虽然LLM在未经修改的文本上表现可靠,但在某些名称替换(尤其是那些具有广泛认知度的替换)下,准确性会下降。该数据集已在此公开提供。