Algorithmic sequence alignment identifies similar segments shared between pairs of documents, and is fundamental to many NLP tasks. But it is difficult to recognize similarities between distant versions of narratives such as translations and retellings, particularly for summaries and abridgements which are much shorter than the original novels. We develop a general approach to narrative alignment coupling the Smith-Waterman algorithm from bioinformatics with modern text similarity metrics. We show that the background of alignment scores fits a Gumbel distribution, enabling us to define rigorous p-values on the significance of any alignment. We apply and evaluate our general narrative alignment tool (GNAT) on four distinct problem domains differing greatly in both the relative and absolute length of documents, namely summary-to-book alignment, translated book alignment, short story alignment, and plagiarism detection -- demonstrating the power and performance of our methods.
翻译:算法序列对齐能够识别文档对之间共享的相似段落,是许多自然语言处理任务的基础。然而,识别不同版本叙事(如翻译和复述)之间的相似性十分困难,尤其对于远短于原著小说的摘要和删节版而言。我们开发了一种通用叙事对齐方法,将生物信息学中的Smith-Waterman算法与现代文本相似度度量相结合。我们证明,对齐得分的背景分布服从冈贝尔分布,从而能够为任何对齐的重要性定义严格的p值。我们将通用叙事对齐工具(GNAT)应用于四个截然不同的问题领域,这些领域在文档的相对和绝对长度上差异显著——即摘要与书籍的对齐、译著对齐、短篇故事对齐以及剽窃检测——充分展示了我们方法的强大性能。