Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL'' paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.
翻译:通过对比学习适应的大语言模型(LLM)在通用表征学习中表现出色,但在化学、法律等垂直领域却面临困难,主要原因是缺乏领域特定知识。本文揭示了一个核心瓶颈:当前主流的“LLM+CL”范式侧重于语义对齐,却无法实现知识获取,导致其在专业术语处理上失效。为弥合这一差距,我们提出了“学习先于表征”(LBR)——一种新颖的两阶段框架。LBR首先通过信息瓶颈约束的生成式学习阶段注入领域知识,在压缩语义的同时保留LLM的因果注意力机制以最大化知识获取;随后在压缩后的表征上执行生成式精炼的对比学习以实现对齐。该方法保持了架构一致性,并解决了生成式学习与对比学习之间的目标冲突。在医疗、化学及代码检索任务上的大量实验表明,LBR显著优于现有强基线方法。本研究为在垂直领域构建精准且鲁棒的表征确立了新范式。