When deriving contextualized word representations from language models, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.
翻译:从语言模型推导上下文化词表示时,需要决定如何处理被分割成子词的词汇表外(OOV)单词。用单一向量表示这些词的最佳方式是什么?这些表示的质量是否低于词汇表内单词的表示?我们针对涉及OOV单词的语义相似性任务,对不同模型生成的嵌入进行了内在评估。分析揭示了一个有趣发现:被分割单词的表示质量通常(但并非总是)低于已知单词的嵌入质量。然而,对其相似度值的解读需谨慎。