Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.
翻译:基于预训练嵌入的密集检索器被广泛应用于文档检索,但在专业领域中常因训练与目标领域分布不匹配而表现不佳。领域适应通常需要对查询-文档对进行昂贵的标注和重新训练。本研究重新审视了一种被忽视的替代方案:对领域嵌入应用主成分分析(PCA)以获得低维表示,该表示在保留领域相关特征的同时舍弃了非判别性成分。尽管传统上该方法主要用于提升效率,但我们证明这种简单的嵌入压缩能有效改善检索性能。通过在9种检索器和14个MTEB数据集上的评估,仅对查询嵌入应用PCA即可在75.4%的模型-数据集组合中提升NDCG@10指标,为领域适应提供了一种简单轻量的解决方案。