Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.
翻译:叙事之间的类比推理对人类推理至关重要。本文通过构建首个大规模故事级类比语料库《StoryAnalogy》(包含来自不同领域的24K个故事对,并基于扩展结构映射理论对两类相似性进行人工标注),评估了类比识别与生成的能力。我们设计了一组针对《StoryAnalogy》的测试,首次实现了对故事级类比识别与生成的评估。有趣的是,我们发现类比识别任务不仅对句子嵌入模型极具挑战性,对ChatGPT和LLaMa等近期大型语言模型(LLM)亦如此。以ChatGPT为例,其在多项选择题中仅达到约30%的准确率(而人类准确率超过85%)。进一步,我们观察到《StoryAnalogy》中的数据可提升LLM的类比生成质量,其中经过微调的FlanT5-xxl模型达到了与零样本ChatGPT相当的性能。