The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning ($\Delta R$) framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs' adaptiveness to changing information, a step toward more reliable AI systems.
翻译:从文本中进行推理的能力对于现实世界的自然语言处理应用至关重要。现实场景常常涉及不完整或不断演化的数据。作为回应,个体会相应地更新其信念和理解。然而,大多数现有评估都假设语言模型在一致的信息下运作。我们引入了Belief-R,这是一个新的数据集,旨在测试语言模型在面对新证据时的信念修正能力。受人类抑制先前推理方式的启发,该任务在新提出的增量推理框架内评估语言模型。Belief-R包含一系列前提,旨在模拟额外信息可能迫使语言模型修正先前结论的场景。我们评估了约30个语言模型,并采用了多种提示策略,发现语言模型普遍难以根据新信息适当地修正其信念。此外,擅长更新的模型在无需更新的场景中往往表现不佳,突显了一个关键的权衡。这些见解强调了提高语言模型对变化信息的适应性的重要性,这是迈向更可靠人工智能系统的一步。