Analyzing the pattern of semantic variation in long real-world texts such as books or transcripts is interesting from the stylistic, cognitive, and linguistic perspectives. It is also useful for applications such as text segmentation, document summarization, and detection of semantic novelty. The recent emergence of several vector-space methods for sentence embedding has made such analysis feasible. However, this raises the issue of how consistent and meaningful the semantic representations produced by various methods are in themselves. In this paper, we compare several recent sentence embedding methods via time-series of semantic similarity between successive sentences and matrices of pairwise sentence similarity for multiple books of literature. In contrast to previous work using target tasks and curated datasets to compare sentence embedding methods, our approach provides an evaluation of the methods 'in the wild'. We find that most of the sentence embedding methods considered do infer highly correlated patterns of semantic similarity in a given document, but show interesting differences.
翻译:分析长篇真实文本(如书籍或转录稿)中语义变化的模式,从风格、认知和语言学角度具有重要研究价值。该分析同样对文本分割、文档摘要及语义新颖性检测等应用场景具有实用意义。近年来多种向量空间句子嵌入方法的出现,使得此类分析成为可能。然而,这引出了一个关键问题:不同方法生成的语义表征在一致性与意义性方面究竟表现如何。本文通过构建连续句子间的语义相似性时间序列,以及多本文学作品的逐对句子相似性矩阵,对多种近期句子嵌入方法进行了比较。与以往利用目标任务和精心标注数据集比较句子嵌入方法的研究不同,我们的方法提供了对方法在实际应用场景('in the wild')中的评估。研究发现,大多数被考察的句子嵌入方法在给定文档中确实推断出高度相关的语义相似性模式,但仍展现出有趣的差异性。