Domain-specific synonyms occur in many specialized search tasks, such as when searching medical documents, legal documents, and software engineering artifacts. We replicate prior work on ranking domain-specific synonyms in the consumer health domain by applying the approach to a new language and domain: identifying Swedish language synonyms in the building construction domain. We chose this setting because identifying synonyms in this domain is helpful for downstream systems, where different users may query for documents (e.g., engineering requirements) using different terminology. We consider two new features inspired by the change in language and methodological advances since the prior work's publication. An evaluation using data from the building construction domain supports the finding from the prior work that synonym discovery is best approached as a learning to rank task in which a human editor views ranked synonym candidates in order to construct a domain-specific thesaurus. We additionally find that FastText embeddings alone provide a strong baseline, though they do not perform as well as the strongest learning to rank method. Finally, we analyze the performance of individual features and the differences in the domains.
翻译:领域特定同义词出现在许多专业搜索任务中,例如搜索医学文档、法律文档和软件工程工件时。我们通过将先前关于消费者健康领域领域特定同义词排序的研究方法应用于新语言和领域,即在建筑施工领域中识别瑞典语同义词,来复现该研究。我们选择这一设置是因为在该领域中识别同义词有助于下游系统,其中不同用户可能使用不同的术语查询文档(例如工程需求)。我们考虑了基于语言变化和方法进步(自先前研究发表以来)的两个新特征。使用建筑施工领域数据的评估支持先前研究的发现,即同义词发现最好被视作一种学习排序任务,其中人类编辑查看排序后的同义词候选,以构建领域特定词典。此外,我们发现仅使用FastText嵌入就提供了强基线,尽管其性能不如最强的学习排序方法。最后,我们分析了各个特征的表现以及领域之间的差异。