Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures $D$ using three different representations of texts -- vocabularies, word frequency distributions, and vector embeddings -- and three simple tasks -- clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen--Shannon divergence applied to word frequencies performed strongly across all tasks, that $D$'s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different $D$'s when the two texts varied in length by a factor $h$. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the $h$-dependency of the bias of the estimator of the generalised Jensen--Shannon divergence applied to word frequencies. We also found numerically that the Jensen--Shannon divergence and embedding-based approaches were robust to changes in $h$, while the Jaccard distance was not.
翻译:量化两个文本的不相似性是许多自然语言处理任务(包括语义信息检索、主题分类和文档聚类)中的一个重要方面。本文通过三种不同的文本表示——词汇表、词频分布和向量嵌入——以及三项简单任务(按作者、主题和时间段对文本进行聚类),比较了不同不相似性度量$D$的性质和性能。利用Project Gutenberg数据库,我们发现,应用于词频的广义Jensen-Shannon散度在所有任务中表现强劲,基于向量嵌入表示的$D$在较短文本上表现更优,而最佳方法的选择最终取决于具体任务。我们还通过解析和数值方法研究了当两个文本长度相差因子$h$时不同$D$的行为。我们证明了词汇表Jaccard距离的(自然)估计量是不一致的,并明确计算了应用于词频的广义Jensen-Shannon散度估计量偏差的$h$依赖性。数值结果还表明,Jensen-Shannon散度和基于嵌入的方法对$h$的变化具有鲁棒性,而Jaccard距离则不然。