We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even surpasses a basic Sentence Transformer model (SimCSE) on a text embedding benchmark. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are not highly relevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.
翻译:本文提出了一种针对句子语义表征优化的新型静态词向量方法。我们首先从预训练的Sentence Transformer中提取词向量,并通过句子级主成分分析进行优化,随后结合知识蒸馏或对比学习进行增强。在推理阶段,我们通过简单平均词向量来表示句子,该方法计算成本极低。我们在单语和跨语言任务上评估模型性能,结果表明:在句子语义任务中,本模型显著优于现有静态模型,甚至在文本嵌入基准测试中超越了基础版Sentence Transformer模型(SimCSE)。最后,我们通过多维度分析证明,本方法能有效剔除与句子语义关联度较低的词向量成分,并根据词汇对句子语义的影响程度调整向量范数。