Sentence embeddings produced by Pretrained Language Models (PLMs) have received wide attention from the NLP community due to their superior performance when representing texts in numerous downstream applications. However, the high dimensionality of the sentence embeddings produced by PLMs is problematic when representing large numbers of sentences in memory- or compute-constrained devices. As a solution, we evaluate unsupervised dimensionality reduction methods to reduce the dimensionality of sentence embeddings produced by PLMs. Our experimental results show that simple methods such as Principal Component Analysis (PCA) can reduce the dimensionality of sentence embeddings by almost $50\%$, without incurring a significant loss in performance in multiple downstream tasks. Surprisingly, reducing the dimensionality further improves performance over the original high-dimensional versions for the sentence embeddings produced by some PLMs in some tasks.
翻译:预训练语言模型(PLMs)生成的句子嵌入因其在众多下游应用中表示文本时的卓越性能,受到了自然语言处理领域的广泛关注。然而,在内存或计算受限的设备上表示大量句子时,PLMs生成的句子嵌入的高维度问题尤为突出。作为解决方案,我们评估了无监督降维方法,以降低PLMs生成的句子嵌入的维度。实验结果表明,诸如主成分分析(PCA)等简单方法可将句子嵌入的维度降低近50%,而在多个下游任务中性能损失不显著。令人惊讶的是,在某些任务中,对于某些PLMs生成的句子嵌入,进一步降维甚至能提升其性能,超越原始高维版本。