Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

The advancement of transformer neural networks has significantly elevated the capabilities of sentence similarity models, particularly in creating effective vector representations of natural language inputs. However, these models face notable challenges in domain-specific contexts, especially in highly specialized scientific sub-fields. Traditional methods often struggle in this regime, either overgeneralizing similarities within a niche or being overly sensitive to minor differences, resulting in inaccurate text classification and subpar vector representation. In an era where retrieval augmentation and search are increasingly crucial, precise and concise numerical representations are essential. In this paper, we target this issue by assembling niche datasets using co-citations as a similarity metric, focusing on biomedical domains. We employ two key strategies for fine-tuning state-of-the-art models: 1. Domain-specific Fine-Tuning, which tailors pretrained models to a single domain, and 2. Universal Applicability with Mixture of Experts (MoE), adapting pretrained models with enforced routing for multiple domains simultaneously. Our training approach emphasizes the use of abstracts for faster training, incorporating Multiple Negative Rankings loss for efficient contrastive learning. Notably, our MoE variants, equipped with $N$ experts, achieve the efficacy of $N$ individual models, heralding a new era of versatile, One-Size-Fits-All transformer networks for various tasks. This methodology marks significant advancements in scientific text classification metrics and holds promise for enhancing vector database search and compilation.

翻译：Transformer神经网络的发展显著提升了句子相似度模型的能力，尤其在针对自然语言输入生成有效向量表示方面。然而，这些模型在特定领域场景中仍面临显著挑战，尤其是在高度专业化的科学子领域。传统方法在此类场景中往往表现不佳，既可能过度泛化小众领域的相似性，又可能对细微差异过度敏感，导致文本分类不准确和向量表示效果欠佳。在检索增强与搜索日益重要的时代，精确而简洁的数字表示至关重要。本文通过利用共被引作为相似性度量构建小众领域数据集，聚焦生物医学领域，旨在解决该问题。我们采用两种关键策略对前沿模型进行微调：1）领域特定微调，使预训练模型适应单一领域；2）基于专家混合模型（MoE）的通用可迁移方案，通过强制路由机制使预训练模型同时适配多个领域。我们的训练方法强调利用摘要来加速训练，并引入多负排序损失实现高效对比学习。值得注意的是，配备N个专家模块的MoE变体可达到N个独立模型的等效性能，开启了适用于多任务的通用型Transformer网络新时代。该方法在科学文本分类指标上取得重要突破，并为向量数据库检索与编译优化提供了新范式。