We hereby present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information (such as names, addresses, identification codes), and from which we need to derive a definition for when they are similar and when they are not. The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences. The goal of these distributions is to find a discriminating factor, that we call "threshold", which represents a well-defined quantity that can be used to distinguish vector distances of similar pairs from vector distances of dissimilar pairs in new predictions and later analyses. In addition, we developed a way to score the predictions by combining attributes from both the distributions' features and the way the distance function works. Finally, we generalize the results showing that they can be transferred to a wider range of domains by applying the system discussed to a well-known and widely used benchmark dataset for STS problems.
翻译:本文提出了一种解决语义文本相似度(STS)问题的方法,该问题需要匹配仅以高度特异性信息(如姓名、地址、识别码)作为唯一区分特征的两个句子,并由此推导出判定相似与不相似的定义。该方案基于孪生架构的神经网络,通过生成相似句子对与不相似句子对之间距离的分布。这些分布的目标是找到一种判别因子(称为“阈值”),其代表一个精确定义量,可用于区分新预测及后续分析中相似语义向量距离与不相似语义向量距离。此外,我们结合分布特征与距离函数的运行方式,开发了一种对预测结果进行评分的方法。最后,通过将所讨论系统应用于STS问题中广泛使用的标准基准数据集,我们证实了该方法的泛化能力可扩展至更广泛的领域。