Semantic representations of text, i.e. representations of natural language which capture meaning by geometry, are essential for areas such as information retrieval and document grouping. High-dimensional trained dense vectors have received much attention in recent years as such representations. We investigate the structure of semantic spaces that arise from embeddings made with Sentence-BERT and find that the representations suffer from a well-known problem in high dimensions called hubness. Hubness results in asymmetric neighborhood relations, such that some texts (the hubs) are neighbours of many other texts while most texts (so-called anti-hubs), are neighbours of few or no other texts. We quantify the semantic quality of the embeddings using hubness scores and error rate of a neighbourhood based classifier. We find that when hubness is high, we can reduce error rate and hubness using hubness reduction methods. We identify a combination of two methods as resulting in the best reduction. For example, on one of the tested pretrained models, this combined method can reduce hubness by about 75% and error rate by about 9%. Thus, we argue that mitigating hubness in the embedding space provides better semantic representations of text.
翻译:摘要:文本的语义表示,即通过几何结构捕捉自然语言含义的表示,对于信息检索和文档分组等领域至关重要。高维训练得到的密集向量作为此类表示,近年来备受关注。我们研究了由句子BERT生成的嵌入所产生的语义空间结构,发现这些表示存在一个高维空间中的常见问题——枢纽性。枢纽性导致邻域关系不对称,使得部分文本(枢纽)成为许多其他文本的邻居,而大部分文本(所谓反枢纽)仅与少数或没有其他文本相邻。我们使用枢纽性分数和基于邻域分类器的错误率来量化嵌入的语义质量。研究发现,当枢纽性较高时,应用枢纽性减少方法可降低错误率和枢纽性。我们确定两种方法的组合能实现最佳减少效果。例如,在测试的预训练模型之一上,该组合方法可将枢纽性降低约75%,错误率降低约9%。因此,我们认为在嵌入空间中缓解枢纽性可提供更优的文本语义表示。