Pre-trained contextual language models are ubiquitously employed for language understanding tasks, but are unsuitable for resource-constrained systems. Noncontextual word embeddings are an efficient alternative in these settings. Such methods typically use one vector to encode multiple different meanings of a word, and incur errors due to polysemy. This paper proposes a two-stage method to distill multiple word senses from a pre-trained language model (BERT) by using attention over the senses of a word in a context and transferring this sense information to fit multi-sense embeddings in a skip-gram-like framework. We demonstrate an effective approach to training the sense disambiguation mechanism in our model with a distribution over word senses extracted from the output layer embeddings of BERT. Experiments on the contextual word similarity and sense induction tasks show that this method is superior to or competitive with state-of-the-art multi-sense embeddings on multiple benchmark data sets, and experiments with an embedding-based topic model (ETM) demonstrates the benefits of using this multi-sense embedding in a downstream application.
翻译:预训练的上下文语言模型被广泛用于语言理解任务,但不适用于资源受限系统。在这些场景下,非上下文词嵌入是一种高效的替代方案。这类方法通常使用单一向量编码单词的多个不同含义,并因多义性产生误差。本文提出一种两阶段方法,通过注意力机制提取预训练语言模型(BERT)中单词在上下文中的多个词义,并将这些词义信息迁移到类似skip-gram框架的多义嵌入中。我们提出一种有效策略,利用从BERT输出层嵌入中提取的词义分布来训练模型中的词义消歧机制。在上下文词语相似度和词义归纳任务上的实验表明,该方法在多个基准数据集上优于或媲美最先进的多义嵌入方法,而基于嵌入的主题模型(ETM)实验则展示了该多义嵌入在下游应用中的优势。