Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
翻译:近期在文本嵌入领域的大规模对比预训练研究表明,使用单一来源的小批量数据而非混合来源的小批量数据,能够显著提升模型的整体准确率。本工作中,我们探索将训练数据分层策略扩展至来源粒度之外:通过利用预训练的文本嵌入模型与经典的k-means聚类算法,依据各数据源内部的语义聚类对训练数据进行进一步细分。实验表明,在基于MSMARCO段落检索数据集的查询-段落对上对BERT架构的文本嵌入模型进行预训练时,我们观察到NDCG@10指标出现显著提升。此外,我们从概念层面将本研究的聚类方法分别与TAS-B方法中的主题感知采样(TAS)机制以及ANCE方法中基于最近邻的困难负例挖掘机制建立联系,并探讨这种统一视角如何为对比预训练数据组织方式的未来研究方向提供理论动机。