Although contextualized embeddings generated from large-scale pre-trained models perform well in many tasks, traditional static embeddings (e.g., Skip-gram, Word2Vec) still play an important role in low-resource and lightweight settings due to their low computational cost, ease of deployment, and stability. In this paper, we aim to improve word embeddings by 1) incorporating more contextual information from existing pre-trained models into the Skip-gram framework, which we call Context-to-Vec; 2) proposing a post-processing retrofitting method for static embeddings independent of training by employing priori synonym knowledge and weighted vector distribution. Through extrinsic and intrinsic tasks, our methods are well proven to outperform the baselines by a large margin.
翻译:虽然大规模预训练模型生成的上下文化嵌入在许多任务中表现出色,但传统静态嵌入(如Skip-gram、Word2Vec)因其计算成本低、易于部署且稳定性强,在低资源和轻量级场景中仍发挥着重要作用。本文旨在通过以下两方面改进词嵌入:1)将现有预训练模型中的更多上下文信息融入Skip-gram框架,我们称之为上下文到向量(Context-to-Vec);2)提出一种独立于训练的静态嵌入后处理改造方法,该方法利用先验同义词知识和加权向量分布。通过外部任务和内在任务评估,我们的方法被充分证明在基线方法上取得了显著优势。