Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.
翻译:实体与事件对于自然语言推理至关重要,且在过程文本中普遍存在。现有工作要么专注于实体状态追踪(如平底锅是否热),要么专注于事件推理(如触摸平底锅是否会烫伤自己),而这两项任务通常存在因果关系。我们提出CREPE,这是首个关于事件合理性与实体状态因果推理的基准测试。研究表明,包括GPT-3在内的大多数语言模型在该任务上的F1分数接近随机水平(0.35),远低于人类水平(0.87)。通过创造性地将事件表示为编程语言,并对预训练于代码的语言模型进行提示,我们将模型性能提升至0.59 F1。通过将实体与事件之间的因果关系作为中间推理步骤注入到我们的表示中,我们进一步将性能提升至0.67 F1。我们的发现不仅揭示了CREPE为语言模型带来的挑战,也证明了类代码提示结合思维链提示在多跳事件推理中的有效性。