While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.
翻译:尽管配备思维链提示等技术的大型语言模型已展现出令人瞩目的能力,但在复杂环境中进行稳健推理仍存在不足。然而,评估大型语言模型的推理能力充满挑战——一方面系统能力持续提升,另一方面逻辑演绎等任务的基准数据集却长期停滞。我们提出MuSR数据集,旨在评估语言模型在自然语言叙事中执行多元软推理任务的能力。该数据集具有两个关键特征:首先,通过新颖的神经符号合成到自然生成算法构建,能够生成挑战GPT-4的复杂推理实例(例如约1000词长度的谋杀谜案),并可在更强大的模型发布后进一步扩展规模;其次,数据集实例以自由文本叙事形式呈现,对应现实世界推理领域——这使其既比其它合成基准更具挑战性,又保持人类标注者以高精度求解的现实性与可行性。我们在此数据集上评估了多种语言模型与提示技术,揭示了思维链等方法实现稳健推理仍需跨越的差距。