Distributed representations of words encode lexical semantic information, but what type of information is encoded and how? Focusing on the skip-gram with negative-sampling method, we found that the squared norm of static word embedding encodes the information gain conveyed by the word; the information gain is defined by the Kullback-Leibler divergence of the co-occurrence distribution of the word to the unigram distribution. Our findings are explained by the theoretical framework of the exponential family of probability distributions and confirmed through precise experiments that remove spurious correlations arising from word frequency. This theory also extends to contextualized word embeddings in language models or any neural networks with the softmax output layer. We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word in tasks such as keyword extraction, proper-noun discrimination, and hypernym discrimination.
翻译:分布式词汇表示编码了词汇语义信息,但究竟编码了何种信息以及如何编码?针对负采样跳字模型,我们发现静态词嵌入的平方范数编码了该词所传递的信息增益;该信息增益定义为该词共现分布与一元分布之间的KL散度。我们的发现通过指数族概率分布的理论框架得到解释,并通过排除词频导致的虚假相关性的精确实验得到证实。该理论同样适用于语言模型中的上下文词嵌入或任何使用softmax输出层的神经网络。我们还证明了在关键词提取、专有名词判别和上位词判别等任务中,KL散度与嵌入平方范数均可作为衡量词汇信息量的有效指标。