Analogy-making between narratives is one of the most critical abilities in natural language understanding. In this paper, we evaluate the ability to identify and generate analogy by building a first-of-its-kind large-scale story-level analogy corpus, StoryAnalogy, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on StoryAnalogy, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are extremely challenging not only for the sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa, where ChatGPT only achieved around 30% accuracy in multiple-choice questions (> 85% accuracy for humans). Finally, we find that data in StoryAnalogy can improve LLMs analogy generation quality, where a fine-tuned FlanT5-xxl model yields comparable performance to zero-shot ChatGPT.
翻译:叙事之间的类比能力是自然语言理解中最关键的能力之一。本文通过构建首个大规模故事级类比语料库StoryAnalogy,评估了识别和生成类比的能力。该语料库包含来自不同领域的24K故事对,并基于扩展结构映射理论对两种相似性进行了人工标注。我们在StoryAnalogy上设计了一系列测试任务,首次对故事级类比的识别与生成进行了评估。有趣的是,我们发现类比识别任务不仅对句子嵌入模型极具挑战性,对ChatGPT和LLaMa等最新大型语言模型(LLMs)也同样困难——ChatGPT在多项选择题中的准确率仅约30%(人类准确率超过85%)。最后,我们发现StoryAnalogy语料库中的数据能够提升LLMs的类比生成质量,其中微调后的FlanT5-xxl模型取得了与零样本ChatGPT相当的性能。