Ambiguity is ubiquitous in natural language. Resolving ambiguous meanings is especially important in information retrieval tasks. While word embeddings carry semantic information, they fail to handle ambiguity well. Transformer models have been shown to handle word ambiguity for complex queries, but they cannot be used to identify ambiguous words, e.g. for a 1-word query. Furthermore, training these models is costly in terms of time, hardware resources, and training data, prohibiting their use in specialized environments with sensitive data. Word embeddings can be trained using moderate hardware resources. This paper shows that applying DBSCAN clustering to the latent space can identify ambiguous words and evaluate their level of ambiguity. An automatic DBSCAN parameter selection leads to high-quality clusters, which are semantically coherent and correspond well to the perceived meanings of a given word.
翻译:歧义在自然语言中普遍存在。消解歧义含义在信息检索任务中尤为重要。尽管词嵌入承载语义信息,但它们难以很好地处理歧义问题。Transformer模型已被证明能够处理复杂查询中的词语歧义,但无法用于识别歧义词语(例如针对单词查询)。此外,训练这些模型在时间、硬件资源和训练数据方面成本高昂,这限制了它们在具有敏感数据的专业环境中的使用。词嵌入可使用中等硬件资源进行训练。本文表明,将DBSCAN聚类应用于潜在空间能够识别歧义词语并评估其歧义程度。自动化的DBSCAN参数选择可产生高质量的聚类,这些聚类在语义上具有连贯性,且与给定词语的感知含义高度吻合。