We propose to use captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same "message") and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources. An annotation study demonstrates the high reliability of the algorithmically determined characteristic maps.
翻译:我们提出利用网络上的图像说明作为此前未被充分开发的同义改写资源(即具有相同“信息”的文本),并构建及分析相应的数据集。当图像在网络中被重复使用时,原始说明常被重新分配。我们假设同一图像的不同说明自然构成一组相互同义改写。为验证这一思路的可行性,我们分析了英文维基百科中的图像说明,发现编辑者常为不同文章重新标注同一图像。本文介绍了底层挖掘技术,并将已知的同义改写语料库与新资源在句法和语义同义改写相似度方面进行对比。在此背景下,我们引入沿两个相似度维度生成的特征图谱,以识别源自不同来源的同义改写风格。一项人工标注研究证明了算法确定特征图谱的高度可靠性。