Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE's fine-tuning brings valuable changes to the spatial organization of embeddings.
翻译:词汇语义学关注词语在不同语境中呈现的多种义项,以及不同词语含义之间的语义关系。语境化语言模型作为研究词汇意义的重要工具,能够提供语境敏感的表示,从而支持对词汇语义的探究。近期研究如XL-LEXEME通过“上下文词语判别”任务对模型进行微调,以获得语义更精准的表示,但该任务仅比较同一词元的不同实例,限制了所捕获信息的范围。本文提出一种扩展任务——概念区分,以涵盖跨词语的语义场景。我们基于SemCor数据构建了适用于该任务的数据集,并在此数据集上对多种表示模型进行微调。我们将所得模型称为概念对齐嵌入。通过在多种词汇语义任务上测试本模型与其他模型的表现,我们证明所提出的模型能够提供高效的多功能词汇语义表示,在实验中达到最佳性能。此外,我们还发现CALE的微调过程为嵌入的空间组织结构带来了显著且有益的改进。